UNIVERSITY  OF  ILLINOIS  - URBANA,  ILLINOIS 


78  07  10  160 


[-5  S 

>.  . . b ^ '•  ' r;-  ■ ^ . T 5- ’i  . i-  ■“■  ..  v'^BH 


!i  ^svVl':;^-. . . ^;^!4SS3 


^^V’^ 


vvr 


REPORT  R-796  OCTOBER.  1977 


UIL0-EN6  77-2243 


COORDINATED  SCIENCE  LABORATORY 


ARCHITECTURE  FOR 
MULTIPLE  INSTRUCTION 
STREAM  LSI  PROCESSORS 


UNCLASSIFIED 

I$CCU«ITV  classification  of  this  page  (Whmn  Dmts  CnffdJ 

REPORT  DOCUMENTATION  PAGE  befoII’^completwg'Vorm 

I.  REPORT  NUMBER  jz.  GOVT  ACCESSION  NO.  3.  RECIPIENT'S  CAT ALOS  NUMBER 


REPORT  NUMBER 


I 4.  title  (and  Sublltf) 


: ■ ^CHITECTURE  FOR  MULTIPLE  INSTRUCTION 

'O"  STREA^’  LSI  gRCf^F-SSORS  > 

7.  AUTMORf*;  ~~~ 


/ 


S.  .TYPE  OF  REPORT  i PERIOD  COVERED 

/ j Technical 

» REREORMIWC  ORBi  REPOR  I RUWtR 

|r-796j  UILU-ENG-77-2243 

i.  Contract  OR  gRRuT-wuniBERfir  ' 


l j WILLIAM  JOSEPH/^INSKY,  Jr|  . 


DAA£-g7-72-C-0259j 
({t  '/^-McO3-03488fA01 


9.  PERFORMING  ORGANIZATION  NAME  ANO  ADDRESS 

Coordinated  Science  Laboratory 

University  of  Illinois  at  Urbana-Champaign^^ 

Urbana,  Illinois  61801 


11.  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 


Joint  Services  Electronics  Program 


10.  PROGRAM  element,  PROJECT.  TASK 
AREA  « WORK  UNIT  NUMBERS 


/,']  j OctbbiaB3a;977  I / 


u.  MON  t TORINO  AGENCY  NAME  & AOORESSr/f  dUtmfnt  from  ControiUng  Otticm)  tS.  SECURITY  CLASS.  JoT 

UNCLASSIFIED 

\Sm.  DEC  L ASSI  PIC  ATI  ON/ OO'i^  GRADING 
SCHEDULE 


l«.  OlSTRieuTlON  STATEMENT  ro/ fhl«  R#porf; 

Approved  for  public  release;  distributiou  unlimited. 


>7.  DISTRIBUTION  STATEMENT  (of  tho  obotroet  onfrod  in  Bfock  20,  it  difforent  from  Report) 


19.  KEY  WORDS  (Continuo  on  revoroo  tide  if  neeettsry  and  identity  by  block  number) 

Pipelining 
Precedence  Graph 
Multiprocess 


20.  ABSTRACT  (Continue  on  reverte  tide  if  necettmry  and  identify  by  block  number) 

Advancements  of  the  semiconductor  industry  have  produced  an  exponential 
Increase  in  the  capacity  of  integrated  circuit  chips.  As  this  trend  continues, 
much  more  complex  systems  will  be  able  to  be  implemented  on  a single  chip. 

This  research  is  aimed  at  studying  computer  architectures  which  are  most 
suited  for  LSI  implementation.  A pipelined  multiprocessor  architecture  is 
derived  and  methodology  given  which  determines  a layout  for  such  a processor 
from  a given  instruction  set.  The  problem  of  partitioning  a processor  when 
it  is  too  large  for  one  chip  is  also  examined.  An  evaluation  of  the 


DO  147 


^TION  OF  1 NjJV  6S  IS  OMOl^E 

78  0 A' 


A UNCLASSIFIED 


ISSIFICATION  OF  This  rage  flWi.n  Dmtm  Enr>r*W) 


UNCLASSIFIED 


ttCUWITV  CLAStIFICATIOM  Or  TMIt  ^AOtflWmi  Df 


UILU-ENG  77-2243 


ARCHITECTURE  FOR  MULTIPLE  INSTRUCTION 
STREAM  LSI  PROCESSORS 

by 

William  Joseph  Kaminsky,  Jr. 


This  work  was  supported  in  part  by  the  Joint  Services  Electronics 
Program  (U.S.  Army,  U.S.  Navy  and  U.S.  Air  Force)  under  Contract  DAAB-07- 
72-C-0259  and  in  part  by  the  National  Science  Foundation  under  Grant 
MCS  73-03488  AOl. 


Reproduction  in  whole  or  in  part  is  permitted  for  any  purpose 
of  the  United  States  Government. 


Approved  for  public  release.  Distribution  unlimited. 


ARCHITECTURE  FOR  MULTIPLE  INSTRUCTION  STREAM  LSI  PROCESSORS 


BY 

WILLIAM  JOSEPH  KAMINSKY,  JR. 

B.S.,  Purdue  University,  1973 
M.S.,  University  of  Illinois,  1974 


THESIS 

Submitted  in  partial  fulfillment  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy  in  Electrical  Engineering 
in  the  Graduate  College  of  the 
University  of  Illinois  at  Urbana-Champaign,  1977 


Thesis  Adviser:  Professor  Edward  S.  Davidson 


Urbana,  Illinois 


ARCHITECTUBE  FOR  MULTIPLE  INSTRUCTION  STREAM  LSI  PROCESSORS 

William  Joseph  Kaminsky,  Jr.,  Ph.D. 

Department  of  Electrical  Engineering 
University  of  Illinois  at  Urbana-Champaign,  1977 

\ 

Advancements  of  the  semiconductor  industry  have  produced  an 
exponential  increase  in  the  capacity  of  integrated  circuit  chips.  As  this 
trend  continues,  much  more  complex  systems  will  be  able  to  be  implemented 
on  a single  chip.  This  research  is  aimed  at  studying  computer  architectures 
v^ich  are  most  suited  for  LSI  implementation.  A pipelined  multiprocessor 
architecture  is  derived  and  methodology  given  which  determines  a layout  for 
such  a processor  from  a given  instruction  set.  The  problem  of  partitioning 
a processor  when  it  is  too  large  for  one  chip  is  also  examined.  An  evaluation 
of  the  performance  of  the  pipelined  processor  as  compared  to  a single  stream 
processor  is  made.  Three  program  traces  are  analyzed  to  study  their  usage 
of  registers  and  this  data  is  used  to  determine  the  performance  of  the 
processors  with  various  degrees  of  pipelining  and  number  of  data  registers. 


iii 


r 


ACKNOWLEDGMENT 

The  author  wishes  to  extend  his  deepest  gratitude  to  his  advisor- 
Professor  Edward  S.  Davidson.  Dr.  Davidson's  perceptive  guidance  and 
generous  patience  as  well  as  his  careful  review  of  this  manuscript  were 
indispensable  to  this  research. 

The  author  also  wishes  tc  thank  his  colleagues  for  their  contribu- 
tions to  this  research,  naniely,  Balasubramanian  Kumar  and  Dan  Hammerstrom 
for  their  programming  help  along  with  Rob  Budzinski,  Joel  Emer  and  Alan  Grant 
for  their  constructive  comments  and  discussions  with  respect  to  this  research. 


The  typing  of  Mary  McMillen  and  the  support  of  the  CSL  staff  are 
also  acknowledged. 


iv 

TABLE  OF  CONTENTS 

CHAPTER  Page 

1.  INTRODUCTION  1 

1.1.  Current  State  of  the  Art  1 

1.2.  Future  LSI  Designs 3 

1.3.  Research  Outline  7 

2.  DESIGN  METHODOLOGY  FOR  A PIPELINED  LSI  MULTIPROCESSOR  8 

2.1.  Processor  Definition  8 

2.2.  Basic  Assumptions  11 

2.3.  Fijnctional  Components  12 

2.4.  Instruction  Precedence  Graph  13 

2.5.  Defining  the  Minimum  Set  of  Resources  17 

2.6.  Modified  Precedence  Graph  24 

2.7.  The  Time  Assignment  Problem  26 

2.8.  Processor  Layout  31 

2.9.  Conclusion  35 

3.  PARTITIONING  37 

3.1.  The  Partitioning  Problem  37 

3.2.  Approaches  38 

3.3.  Bit  Slice  Analysis  43 

3.4.  Connection  Graph  Model  47 

3.5.  Register  Slice  Analysis  52 

3.6.  Summary  5I 

4.  EFFECTS  OF  REGISTER  SET  SIZE  ON  PERFORMANCE  64 

4.1.  Performance  Trade-offs  Between  a Single  Stream 

Processor  and  a Pipelined  Processor  64 


V 


VI 


LIST  OF  FIGURES 

Figure  Page 

1.1.  Set  of  Several  Independent  Processors  in  One  Chip  5 

2.1.  Fvinctional  Components  and  Instruction  Precedence 

Graph. for  the  "ADD"  Instruction  15 

2.2.  Register  Transfer  Diagram  IS 

2.3.  Modified  Register  Transfer  Diagram  21 

2.4.  Modified  Precedence  Graph  25 

2.5.  Merged  Precedence  Graph  28 

2.6.  Resultant  Merged  Precedence  Graph  32 

I 

2.7.  Final  Layout  33 

3.1.  A Register  Partition  39 

3.2.  A Bit  Slice  Partition  40 

3.3.  A Time  Slice  Partition  42 

3.4.  Connection  Graph  of  Processor  49 

3.5.  Extended  Connection  Graph  51 

3.6.  Simplified  Extended  Connection  Graph  53 

3.7.  Register  Partition  of  Example  Processor  54 

3.8.  Partition  of  a Special  Four  Bus  Chip  54 

4.1.  Plot  of  Register  Faults/Memory  Reference  vs. 

Number  of  Data  Registers  81 

4.2.  Constant  Performance  Plot  for  "GAUS"  87 

4.3.  Constant  Performance  Plot  for  "EIGEN"  88 

4.4.  Constant  Performance  Plot  for  "LIST"  89 


Figure 

A.l.  Instruction  Precedence  Graphs  I 

A. 2.  Instruction  Precedence  Graphs  II 

A. 3.  Instruction  Precedence  Graphs  III 

E.l.  Graphical  Model  of  Processor 


1 


CHAPTER  1.  INTRODUCTION 

I.L.  Current  State  of  the  Art 

Advancements  made  by  the  semiconductor  industry  have  greatly  increased 
the  complexity  of  a single  integrated  circuit  chip.  Over  the  nearly  20  year 
history  of  the  industry  there  has  been  an  exponential  increase  in  the  complexity 
of  integrated  circuits  [NOY76]  and  it  is  expected  that  this  trend  will  continue. 
Accompanied  by  this  increase  in  chip  complexity  is  a reduction  in  cost  per 
gate  of  digital  systems  implemented  with  LSI.  One  of  the  most  dramatic  recent 
achievements  of  the  industry  is  the  implementation  of  small  CPU's,  called 
microprocessors.  As  this  trend  of  increasing  chip  size  and  density  continues, 
it  will  enable  larger  processors  with  greater  capabilities  to  be  implemented 
with  a single  chip.  Not  only  will  processors  become  more  complex,  but  semi- 

V 

conductor  memories  are  also  getting  faster,  larger  and  cheaper.  Entire 
computing  systems  will  become  smaller,  more  powerful  and  cheaper,  enabling 
the  use  of  more  powerful  computers  in  applications  where  it  is  now  infeasible. 

In  order  to  produce  efficient  computers  with  large  scale  integrated  circuits, 
the  characteristics  of  large  scale  integration  must  be  studied.  The  purpose 
of  this  research  is  to  examine  the  architectures  suitable  for  LSI  implementa- 
tions of  future  processors. 

The  objective  of  future  LSI  processor  design  is  to  use  the  increasing 
chip  area  available  to  increase  performance  as  much  as  possible.  To  obtain 
maximtjm  performance  from  future  LSI  processors,  the  factors  affecting  cost  of 
LSI  processors  must  be  studied  and  suitable  designs  determined.  Many  present 
microprocessors  were  adapted  for  LSI  from  existing  discrete  computer  designs. 
These  discrete  designs  in  turn  were  originally  designed  to  reduce  gate  count 


2 


and  based  their  processor  cycles  upon  core  memory  cycles.  Relevant  factors 
sxjch  as  pin  count  and  interconnection  patterns  were  only  considered  secondarily. 
Future  LSI  designs  must  be  based  on  other  criteria. 

The  two  main  cost  factors  for  LSI  implementation  are  pin  count  and 
chip  area.  With  exponentially  increasing  chip  area,  chip  area  capacity  no 
longer  limits  the  nxsnber  of  components  as  seriously  as  the  number  of  pins 
allowed  for  connecting  the  internal  components  to  the  rest  of  the  system. 
Although  chips  with  more  pins  are  being  produced,  physical  constraints  sxich  as 
board  area,  practical  chip  package  size  and  packaging  costs  do  not  allow  a 
rapid  increase  of  pins. 

Large  numbers  of  pins  contribute  to  other  costs  as  well.  Area  costs 
are  increased  by  large  numbers  of  pins  since  each  pin  requires  a pin  driver 
which  uses  a significant  amoimt  of  area  due  to  its  need  to  handle  higher 
currents  and  voltages  than  internal  devices.  Connection  pads  are  needed  as 
well.  A greater  number  of  pins  increases  system  costs  since  the  number  of  PC 
board  connections  and  wiring  interconnections  tends  to  increase  with  more  pins. 
Reliability  is  reduced  also  since  many  chip  failures  are  due  to  pin  faults 
[KET76]  and  many  system  failures  result  from  broken  PC  board  connections.  It 
is  therefore  important  to  reduce  the  number  of  pins  to  obtain  a cost  effective 
and  reliable  system.  This  goal  should  be  kept  in  mind  from  the  inception  of 
the  design  process. 

The  other  cost  factor,  chip  area,  primarily  affects  the  yield  of 
the  fajrication  process.  Yields  drop  precipitously  as  the  area  exceeds  some 
practical  limit.  The  practical  area  limit  is  increasing,  but  efficient  use 
of  the  area  allowed  is  important  to  obtain  maximum  cost  effectiveness.  On- 
chip  interconnections,  in  many  cases,  compose  as  much  as  67  percent  of  the 


] 


3 


I 


total  chip  area  [PEN72] , especially  for  "random  logic"  implementations,  which 
comprise  a significant  portion  of  microprocessor  layout.  Reduction  of  on  chip 
interconnection  area  will  increase  the  number  of  active  devices  possible 
within  a chip.  Future  designs  should  strive  for  regular  structure  since  this 
leads  to  less  interconnection  area  and  a higher  component/cost  ratio. 

1,2.  Future  LSI  Designs 

Ther'i  are  several  ways  of  utilizing  additional  chip  area  to  increase 
the  system  performance  of  future  LSI  processors.  One  way  is  to  increase  the 
instruction  repertoire  which  utilizes  additional  area  for  instruction  decoding 
and  control  signal  generation.  This  reduces  the  number  of  instructions  needed 
to  execute  a program  and  may  affect  the  average  instruction  execution  time. 

More  complex  computational  capabilities  could  be  added  such  as  multiply  units 
or  floating  point  units  which  would  reduce  the  time  for  execution  of  numerical 
operations.  The  register  set  could  be  increased  to  reduce  the  number  of 
memory  references  required  and  thereby  reduce  execution  time.  Multiple 
processors  could  be  implemented  on  one  chip  to  process  multiple  programs  at 
one  time.  Pipelining  could  be  used  to  increase  throughput  of  the  processor  or 
as  a way  of  achieving  multiprocessing  with  shared  hardware  instead  of  using 
completely  separate  processors. 

The  two  objectives  for  LSI  designs,  few  pins  and  efficient  use  of 
chip  area,  must  be  considered  in  choosing  which  methods  of  improving  processor 
performance  are  used.  Reducing  pins  is  probably  the  most  important  objective. 
This  research  is  aimed  primarily  at  this  objective. 

First,  consider  the  pin  usage  in  a typical  microprocessor.  A typical 


I 


i 

I 


microprocessor  usually  has  cycles  composed  of  several  phases.  During  one 
phase  an  operation  or  data  transfer  is  accomplished.  If  there  needs  to  be 


4 


information  flow  into  or  out  of  the  CPU,  a pin  or  set  of  pins  will  be  used  to 
transfer  this  information.  Usually  each  pin  is  dedicated  to  one  signal  path. 

In  some  cases,  pins  may  be  shared  by  mutually  exclusive  signals  when  there  is 
no  ambiguity  as  to  which  signal  is  being  transferred.  Many  of  these  pins, 
such  as  memory  address  or  data  pins,  will  be  used  only  once  during  an  entire 
cycle  and  remain  idle  during  the  rest  of  the  cycle.  Idle  pin  time  of  a single 
processor  chip  cannot  be  reduced  greatly  in  some  cases  since  more  pins  are 
required  during  some  phases  than  during  other  phases  and  utilization  of  all 
pins  during  each  phase  cannot  be  accomplished.  However,  if  the  pins  are 
utilized  well  on  a chip,  the  chip  is  likely  to  require  fewer  pins.  Let  the 
pin  utilization  Tip  be  defined  as  follows. 

P . average  number  of  phases  per  cycle  during  which  a pin  is  used  ,,  ,, 
p total  number  of  phases  per  cycle  • I • > 

Let  the  average  for  the  chip  be  Tp  where  n = number  of  pins  and 


Up  --E  Tip-  (1.2) 

n 

For  any  particular  design  T should  be  as  close  to  one  as  possible. 

P 

One  way  to  obtain  high  performance  and  high  pin  utilization  is  to 
use  several  independent  processors  on  one  chip  as  mentioned  previously.  A 
chip  containing  N processors  operating  independently,  but  shifted  in  time  so 
that  the  pins  could  be  shared  using  a multiplexer/demultiplexer  is  shown  in 
Figure  1.1.  Here  the  processors  would  have  identical  cycling  successively 
shifted  in  time  from  processor  to  processor  by  one  phase.  Furthermore,  for 
each  pin,  each  processor  would  only  have  access  to  the  pin  during  a particular 
phase  of  each  cycle.  Each  pin  always  transfers  a particular  signal,  but  for 
successive  processors  in  successive  phases.  For  this  system,  the  pin 


5 


6 


efficiency,  will  be  close  to  one  and  the  performance  will  be  approximately 
N times  that  of  one  of  the  processors  (provided  interference  between  accesses 
to  memory  or  I/O  devices  is  not  significant  and  all  processors  can  be  kept 
busy) . 


There  are  several  problems  with  this  scheme.  Processors  must  be 
timed  synchronously  with  respect  to  each  other.  It  would  require  a very  large 
chip  area  to  hold  N complete  processors  plus  the  multiplexer/demultiplexer 
circuitry.  This  design  would  also  require  long  interconnections  from  each 
processor  to  the  one  multiplexer/demultiplexer  circuit.  One  way  to  reduce 
the  area  required  by  this  architecture  is  to  share  much  of  the  hardware  units 
and  interconnections.  This  sharing  could  be  accomplished  by  a pipelined 
multiprocessor  design. 

A pipelined  multiprocessor  which  processes  multiple  instruction 
streams  within  the  pipe  (instead  of  one  instruction  stream  of  which  several 
instructions  are  in  execution)  would  achieve  the  same  pin  utilization  as  the 
above  set  of  separate  processors.  Each  segment  of  such  a pipe  would  execute 
one  phase  of  one  process  and  control  the  set  of  pins  associated  with  that 
segment  of  the  pipe.  The  pipelined  processor  would,  instead  of  having  N 
copies  of  each  hardware  unit,  only  have  as  many  copies  as  are  used  during  one 
cycle  of  execution  and  share  them  amongst  all  processes.  Since  the  pipelined 
architecture  employs  a uniform  flow  of  data  through  the  pipe,  it  requires  a 
highly  regular  structure  to  be  implemented  on  a chip,  thereby  reducing  the 
interconnection  area  required  and  providing  automatic  timing  of  all  processes 
in  the  pipe.  Since  pins  are  connected  directly  to  one  segment  of  the  pipe, 
no  mul tip lexer/demul tip lexer  hardware  is  required.  This  pipelined  architecture 
combines  increased  performance,  efficient  use  of  chip  area  and  high  pin 


7 


utilization.  Since  this  pipelined  approach  has  many  of  the  characteristics 
suitable  for  LSI  implementation,  this  research  is  aimed  at  developing  this 
architecture.  The  pipelined  architecture  performance  is  also  compared  with 
multiple  data  register  processors  since  the  comparison  is  straightforward  and 
the  multiple  data  register  approach  is  used  in  some  microprocessors  and  many 
t minicomputers  to  improve  performance. 

1.3.  Research  Outline 

Chapter  2 develops  a design  methodology  for  a pipelined  LSI 
multiprocessor  introduced  in  the  previous  section.  This  methodology  begins  by 
using  a given  instruction  set  to  define  the  basic  operations  of  the  processor. 
The  precedence  of  operations  for  each  instruction  is  determined  and  used  to 
define  a shared  set  of  resources.  Timing  of  these  resources  is  analyzed  and 
merges  found  to  reduce  still  further  the  number  of  interconnections  and 
hardware  units.  The  physical  structure  is  then  composed  from  a merged  resource 
utilization  graph. 

Chapter  3 studies  the  pin  problem  when  a partition  of  the  system 
into  several  chips  is  necessary.  Three  possible  methods  are  described  of 
which  two  are  chosen  (a  bit  slice  partition  and  register  partition)  for  further 
modeling  and  comparison  of  effectiveness.  The  advantages  and  disadvantages 
of  each  partitioning  scheme  and  a combination  of  the  two  schemes  are  summarized. 

Chapter  4 studies  the  performance  of  the  pipelined  processor  vs.  a 
multiple  data  register  processor.  This  performance  study  is  based  on  a memory 
port  limited  processor,  since  pin  considerations  limit  the  chip  to  one  memory 
port.  The  register  usage  of  programs  run  on  an  IBM  360  is  analyzed  to 
determine  the  dependence  of  performance  on  the  register  set  size.  This  data 
is  then  used  to  compute  the  performance  of  the  pipelined  processors  with 
various  degrees  of  pipelining  and  numbers  of  data  registers. 

L 


8 


CHAPTER  2.  DESIGN  METHODOLOGY  FOR  A PIPELINED  LSI  MULT TPROnF.SfiQR 

2.1.  Processor  Definition 

The  first  step  toward  obtaining  a pipelined  organization  is  to  define 
the  machine  in  terms  of  the  functions  it  must  perform  as  derived  from  a given 
instruction  set  and  register  set.  The  instruction  set  is  divided  into  two 
parts,  the  user  instructions,  which  are  those  needed  by  the  user  to  perform 
his  computational  requirements  (such  as  add,  move,  increment,  etc.)  and  svs tern 
instructions . which  are  those  needed  by  the  system  to  perform  operating  system 
tasks  and  other  controlling  operations  (such  as  interrupt  handling  instructions, 
I/O  control,  etc.).  The  user  instruction  set  should  be  selected  to  achieve 
efficient  execution  of  the  types  of  programs  to  be  run  on  the  processor. 

These  instructions  need  to  assume  certain  details  of  the  processor,  such  as 
addressing  methods  and  the  size  and  nature  of  the  addressible  register  set, 
but  should  be  defined  in  as  machine-independent  a manner  as  possible.  The 
system  instructions  must  also  be  given  and  are  used  to  carry  out  additional 
operations  which  must  be  performed  by  the  processor.  The  architecture  is 
then  tailored  to  best  execute  these  two  sets  of  instructions.  It  is  also 
helpful  to  be  given  estimated  instruction  frequencies  for  the  set  of  instruc- 
tions, since  it  is  important  to  optimize  execution  of  highly  used  instructions. 

Along  with  the  instruction  set,  a set  of  registers  is  given  which 
are  explicitly  referenced  by  the  instructions  (data  registers,  index  registers, 
etc.).  Next  an  implicit  set  of  registers  which  are  required  by  the  processor 
to  execute  the  instruction  set  and  to  perform  any  operations  are  derived 
(memory  address  and  data  registers,  program  counter,  etc.).  These  two  sets 
of  registers  must  be  implemented  in  the  processor  either  as  physical  registers 

I 


9 


actually  contained  in  the  CPU,  or  as  virtual  registers  which  are  kept  in 
memory,  but  easily  accessed  by  the  CPU. 

The  instruction  set  and  the  set  of  registers  defined  above  are  used 
to  determine  the  physical  structure  of  the  pipelined  CPU.  This  structure 
consists  of  functional  units,  registers  and  communication  paths.  The  following 
sections  describe  the  steps  involved  to  realize  the  final  layout  for  LSI 
implementation. 

To  aid  the  explanation  of  the  procedures,  an  example  will  be  carried 
through  this  chapter  illustrating  the  steps  required.  The  example  is  for 
illustrative  purposes  only  and  is  not  intended  as  a general  result  to  be 
actually  implemented  in  practice. 

The  example  chosen  utilizes  a subset  of  the  instruction  set  of  the 
HP-2100  minicomputer  [HP].  Table  2.1  lists  the  instructions  used  in  the 
example.  The  instructions  for  I/O  control  have  been  omitted  since  these  are 
specifically  for  the  I/O  system  of  the  HP  computer.  The  I/O  control  would  be 
implemented  in  the  control  section  of  the  pipelined  processor,  which  is  beyond 
the  scope  of  this  research.  Only  the  processing  section  is  investigated  in 
this  research. 

The  register  set  chosen  to  be  used  in  the  example  consists  of  a 
Memory  Address  Register  (MA) , a Memory  Data  Register  (MD) , a Program  Counter  j 

(PC),  an  Instruction  Register  (IR),  a Stack  Pointer  (SP),  and  an  Accumulator 
(A).  Only  one  accumulator  was  chosen  since  a mimimal  register  set  is  desired 
and  the  instruction  set  does  not  require  any  specific  quantity  of  registers 
(although  the  actual  HP-2100  implementation  has  two,  called  the  A and  B 
registers).  The  SP  register  has  been  added  to  facilitate  the  use  of  recursive 


procedures. 


Table  2.1.  Instructions  Used  in  Example 


CLR 


DEC 

INC 

COMP 

NEC 

TEST 

ASL 

ASR 

ROL 

ROR 

LOAD 

STORE 

ADD 

SUB 

XOR 

lOR 

AND 


JMP 

JSUB 

RSUB 

WRIO 

RDIO 


CLEAR  A 

CLEAR  STATUS  BIT  N,  Z OR  C 

SET  STATUS  BIT  N,  Z OR  C 

DECREMENT  A 
INCREMENT  A 
COMPLEMENT  A 
NEGATE  A 

SET  STATUS  BITS  ACCORDING  TO  CONTENTS  OF  A 

ARITHMETIC  SHIFT  A LEFT 

ARITHMETIC  SHIFT  A RIGHT 

ROTATE  A LEFT 

ROTATE  A RIGHT 

LOAD  A FROM  MEMORY 

STORE  A TO  MEMORY 

ADD  MEMORY  LOCATION  TO  A (RESULT  IN  A) 

SUB  MEMORY  LOCATION  FROM  A (RESULT  IN  A) 
EXCLUSIVE  OR  A AND  MEMORY  LOCATION  (RESULT  IN  A) 
INCLUSIVE  OR  A AND  MEMORY  LOCATION  (RESULT  IN  A) 
AND  A AND  MEMORY  LOCATION  (RESULT  IN  A) 

CONDITIONAL  JUMP  IF  N,  Z OR  C TRUE 

UNCONDITIONAL  JUMP 
JUMP  TO  SUBROUTINE 
RETURN  FROM  SUBROUTINE 
WRITE  TO  I/O  DEVICE 
READ  FROM  I/O  DEVICE 


11 


2.2.  Basic  Assumptions 

Several  basic  assumptions  were  made  before  proceeding  to  determine 
the  physical  structure  of  the  CPU.  The  first  assimiption  deals  with  the  control 
section  of  the  processor.  The  control  section  can  be  treated  separately  from 
the  processing  section  (that  section  which  performs  operations  on  and  transfers 
data)  and  is  linked  to  the  processing  section  by  the  necessary  control  and 
sense  lines.  It  is  assumed  that  a control  section  exists  which  receives  the 
instruction  and  necessary  sense  signals  and  generates  the  control  signals  for 
the  processing  section,  external  memory  and  I/O  devices.  Methods  of  imple- 
mentation of  control  applicable  to  this  type  of  processor  have  been  described 
by  Kogge  [KOG77]  and  will  not  be  pursued  here. 

The  second  assumption  is  that  the  processor  will  contain  only  one 
Memory  Address  port  and  one  Memory  Data  port.  If  two  or  more  ports  were  used, 
the  number  of  memory  pins  would  be  multiplied  by  the  number  of  ports  and  it 
would  be  difficult  to  obtain  full  utilization  of  these  pins.  A multiple  port 
memory  also  increases  the  complexity  and  cost  of  the  memory  and  busses  which 
is  likely  not  to  be  cost-effective. 

The  third  assumption  is  that  I/O  devices  are  treated  as  memory  loca- 
tions and  communications  to  them  are  made  over  the  memory  data  and  address 
busses  in  a similar  fashion  to  that  used  by  the  PDP-11  [DEC74] . This  assur.p- 
tion  also  reduces  pins  since  if  a separate  port  is  used,  pins  would  have  to 
be  added  to  the  CPU  chip  and  would  be  very  small  for  these  pins.  Using 
the  memory  bus  for  I/O  communications  requires  the  loss  of  a memory  cycle 
each  time  an  I/O  transfer  is  made  (if  memory  otherwise  would  be  requested 
during  the  same  cycle  as  an  I/O  information  transfer),  but  this  should  not 
produce  a significant  degradation  of  performance  since  I/O  devices  run  much 


12 


slower  than  the  CPU.  Any  control  signals  generated  by  the  CPU  control  section 
and  pins  for  I/O  control  (e.g.  interrupt,  flags,  acknowledgment,  etc.)  would 
be  added  to  the  CPU  control  section. 

Special  purpose  hardware  devices  (e.g.  floating  point  unit,  multiply, 
etc.)  could  be  connected  in  the  same  fashion  as  I/O  devices.  These  communicate 
with  the  CPU  over  the  memory  busses.  This  assumption  maintains  high  pin 
utilization  and  minimal  pins,  although  it  also  eliminates  a possible  memory 
reference  for  each  special  device  transfer  cycle. 

2.3.  Functional  Components 

i 

Given  the  instruction  set  and  register  set,  a method  is  needed  for  j 

I 

' describing  each  instruction  in  terms  of  which  resources  (a  resource  is  any  ! 

communication  path,  register  or  functional  unit)  are  required,  what  operations  j 

must  be  performed  and  in  what  order  they  are  performed.  To  do  this  each 
instruction  will  be  decomposed  into  a set  of  functional  components.  These 

functional  components  are  register-to-register  transfers,  functional  operations  i 

and  memory  accesses. 

A register-to-register  transfer  involves  only  a read  from  one 
register  of  its  contents  and  a write  of  this  information  into  another  register. 

This  functional  component  involves  the  use  of  two  registers  and  a connection 
path  between  them. 

The  second  type  of  component  involves  a read  from  one  or  more 
registers  to  a functional  unit  (a  functional  unit  is  a hardware  unit  which 

performs  a functional  operation  (add,  sub,  exclusive  or  shift,  clear,  etc.))  i 

which  performs  the  necessary  operation.  The  result  is  then  written  to  the  ] 

result  register.  Each  ftinctional  unit  is  specified  by  the  types  of  operations 
it  actually  performs  and  not  given  a general  name  such  ALU.  This  specification 

i 

I 

L ^ 


13 


defines  exactly  the  type  of  device  needed  (this  minimizes  the  complexity  of 
the  functional  unit  for  each  unit  required). 

This  minimal  complexity  description  of  each  functional  unit  enables 
distribution  of  appropriate  functional  units  throughout  the  pipeline.  This 
is  useful  for  two  reasons.  Multiple  units  allow  operations  to  be  performed 
in  parallel,  thereby  speeding  computation.  Secondly,  distribution  of  the 
functional  units  reduces  multiple  communication  paths  to  one  unit,  thereby 
reducing  the  required  interconnection  area.  In  fact,  in  some  cases,  the  added 
functional  units  may  require  less  area  than  the  interconnections  would  require 
with  fewer  units.  The  resources  needed  for  each  functional  component  of  the 
second  type  are  the  input  data  register (s),  the  communication  path(s)  from 
the  input  register(s)  to  the  functional  unit,  the  functional  unit,  the  com- 
munication path  from  the  functional  unit  to  result  register  and  the  result 
register. 

The  third  functional  component,  the  memory  access,  first  consists 

I, 

of  a transfer  of  the  address  from  the  MA  register  to  the  memory  address  bus. 
Then,  at  the  appropriate  time,  a write  of  the  data  from  the  memory  data  bus 
to  the  MD  register  for  a read  operation  or  a write  to  the  memory  data  bus 
from  the  MD  register  for  a write  operation.  When  describing  this  functional 
component  the  memory  address  and  data  busses  are  treated  as  registers. 

2.4.  Instruction  Precedence  Graph 

To  describe  each  instruction,  a precedence  graph  is  formed  which 


indicates  which  functional  components  are  needed  and  the  partial  order 
in  which  they  must  be  performed  for  that  particular  instruction.  The  prece- 
dence graph  is  composed  of  a node  representing  each  functional  component  and 


14 


directed  arcs  joining  these  nodes  indicating  the  precedence  of  execution.  Any 
combination  of  functional  components  which  may  be  performed  in  parallel  is 
indicated  by  parallel  paths  in  the  graph. 

As  an  example.  Figure  2.1  shows  the  precedence  graph  and  the  defini- 
tion of  each  node  in  the  graph  for  the  ADD  instruction.  This  instruction  adds 
the  contents  of  register  A and  the  contents  of  the  memory  address  specified 
in  the  instruction  and  places  the  result  in  register  A.  By  examining  the 
graph  in  Figure  2.1,  one  can  see  that  nodes  1,2  and  3 and  node  6 are  in 
parallel  paths  and  can  be  executed  in  parallel.  The  same  occurs  with  nodes 
4 and  5 and  nodes  7 and  8.  However,  node  7 requires  both  nodes  3 and  6 to 
be  performed  before  it  but  node  4 only  requires  node  3 to  be  performed  before 
it,  as  indicated  by  the  directed  arcs.  Once  nodes  5 and  8 have  been  completed, 
this  instruction  is  finished  and  the  next  instruction  may  begin. 

A precedence  graph  for  each  instruction  must  be  formed  before  pro- 
ceeding to  the  next  step.  Appendix  A contains  the  precedence  graphs  for  all 
the  HP- 2100  instructions  shown  in  Table  2.1. 

The  precedence  graph  for  each  instruction  begins  when  data  comes 
from  memory  via  the  data  bus  and  is  transferred  from  the  memory  data  bus  to 
the  IR  and  MD.  In  fact,  this  is  also  used  as  the  beginning  of  the  first 
processor  cycle  for  each  instruction.  The  cycle  time  for  the  pipelined  pro- 
cessor is  the  time  it  takes  a process  to  pass  through  the  entire  pipeline  and 
return  to  the  beginning  of  the  pipe. 

Since  the  processor  has  only  one  memory  port,  as  described  in 
Chapter  1,  only  one  memory  reference  can  be  made  by  a process  during  one 
processor  cycle.  Although  instructions  are  commonly  thought  of  as  starting 


with  the  fetch  (after  PC  — MA)  no  execution  of  the  fetched  instruction  can  be 


1 


Add  : Reg  A + ( MEM  Address)  to  Reg  A 

1) 

2)  ^D;^doress  ►MA 

3 ) MA  MEM^QQPgjg 

4)  MEMdata— ►MD 

5)  MD+A ►A 

6)  PC  + 1 ►PC 

7 ) PC  — ►MA 

8)  MA  ^MEMAQORggg 


FP-5688 


Figure  2.1. 


Functional  Components  and  Instruction  Precedence  Graph  for 
the  "add"  Instruction. 


16 


performed  prior  to  receiving  the  instruction  from  memory.  The  completion  of 
the  previous  instruction  could  be  overlapped  with  the  fetch  provided  generation 
of  the  fetch  does  not  require  any  resources  needed  to  finish  the  previous 
instruction.  The  memory  data  bus  to  IR  and  MD  reference  point  is  chosen  only 
as  a convenient  point  to  start  each  instruction  cycle,  since  overlap  across 
this  point  is  not  likely  to  be  possible.  However,  this  choice  does  not  prohibit 
overlap  at  this  point  between  instructions. 

Since  any  particular  cycle  usually  begins  with  the  functional 
component  MEM  DATA  MD,  the  remaining  functional  components  can  be  executed 
at  various  times  within  a cycle,  possibly  some  in  parallel,  dependent  upon  the 
precedence  relationship,  the  number  of  resources  and  their  position  in  the 
pipe.  The  time  assignment  problem  is  to  determine  the  position  in  time  at 
which  each  of  these  functional  components  is  performed.  This  assignment  must 
be  made  taking  into  consideration  the  resource  assignment  problem,  namely, 
determining  which  communication  paths  are  required,  the  types  of  functional 
units  and  the  number  required  of  each  type.  The  objective  of  both  of  these 
assignments  is  to  strike  a balance  between  the  minimization  of  execution  time 
and  the  minimization  of  resources. 

The  run  time  of  a program  is  significantly  influenced  by  the  number 
of  memory  references  when  it  is  run  on  a pipelined  processor  with  one  memory 
port.  The  performance  can  be  evaluated  by  the  number  of  memory  references 
and  the  time  between  them.  If  the  architecture  is  to  be  changed  to  increase 
the  performance,  either  the  time  between  references  must  be  reduced  or  the 
number  of  memory  references  required  must  be  reduced.  When  evaluating  the 
time  assignment  problem,  the  critical  path  (the  path  of  the  precedence  graph 
which  takes  the  longest  time)  tends  to  be  the  memory  address  generation  and 


17 


memory  response  time.  The  functional  components  necessary  to  generate  the 
memory  address  should  be  executed  as  soon  as  possible  in  the  cycle.  The  rest 
of  the  functional  components  should  be  fit  into  the  cycle  so  that  distinct 
resources  are  minimized  but  the  cycle  time  is  not  increased  and  instructions 
for  the  most  part  do  not  require  additional  cycles  for  execution.  When  the 
memory  address  generation  and  memory  response  time  is  the  critical  path  for 
an  instruction  cycle,  there  should  not  be  any  difficulty  in  assigning  the 
remainder  of  the  functional  components  to  time  slots.  Problems  arise  only  in 
minimizing  resources.  The  following  two  sections  will  determine  first,  the 
minimal  set  of  resources,  then  a practical  method  for  time  assignment  of 
resources . 

2.5.  Defining  the  Minimum  Set  of  Resources 

Once  the  precedence  graph  for  each  instruction  has  been  developed, 
the  resources  required  for  the  processor  can  be  determined.  The  resources, 
as  defined  previously,  consist  of  registers,  functional  units  and  communication 
paths.  Each  functional  component  of  the  precedence  graph  of  each  instruction 
requires  a set  of  resources.  A single  register  transfer  graph  can  be  composed 
from  the  set  of  instruction  precedence  graphs  by  representing  each  register 

and  each  functional  unit  by  a node  and  each  data  communication  path  by  a 

directed  arc  between  the  corresponding  register  or  functional  unit  nodes. 

Separate  nodes  are  used  for  functional  units  .v’hich  send  their  results  to 

different  registers.  This  rule  results  in  multiple  copies  of  the  same  type 
of  functional  unit  such  as  the  multiple  increment  units  used  by  the  PC,  SP 
and  A registers  in  Figure  2.2.  Multiple  functional  units  are  used  to  reduce 
the  number  of  communication  paths.  Using  only  one  functional  unit  of  each 


19 


type  would  require  connection  paths  from  each  register  which  uses  the  func- 
tional unit,  thereby  increasing  the  interconnection  area  required.  This 
restriction  represents  a significant  departure  from  the  traditional  register- 
bus-ALU  type  of  organization.  The  departure  is  motivated  by  an  appreciation 
of  LSI  technology  in  which  replication  of  simple  functional  units  should  take 
less  area  than  the  interconnection  area  required  to  share  a single  copy  of 
each  type  of  functional  unit.  Also,  more  than  one  use  of  any  single  functional 
unit  during  any  one  cycle  would  cause  an  additional  copy  of  this  functional 
unit  anyway  unless  additional  cycles  were  used  to  execute  the  instruction.  A 
register  transfer  graph  for  the  example  is  shown  in  Figure  2.2. 

This  graph  was  composed  using  the  following  method.  First  a node  is 
used  to  represent  each  register  in  the  processor  and  one  for  memory.  It  is 
helpful  to  place  together  nodes  which  represent  registers  which  are  likely  to 
have  many  data  communication  paths  between  them.  I/O  devices  or  channels  may 
also  be  represented  with  nodes,  but  in  this  example  I/O  is  handled  as  a subset 
of  memory  and  is  included  in  the  memory  node.  Secondly,  the  instruction  prece- 
dence graphs  are  examined  one  by  one  and  for  each  functional  component,  the 
necessary  data  communication  paths  and  functional  units  are  determined.  If  a 
data  communication  path  is  not  represented  in  the  data  transfer  graph,  an  arc 
is  added  between  the  corresponding  register  nodes  or  functional  unit  nodes  to 
represent  that  data  transfer  path.  If  a node  does  not  yet  exist  to  represent 
a functional  unit  as  required  by  the  functional  component,  a node  is  added 
for  this  functional  unit  and  arcs  are  added  from  the  functional  unit  to  the 
result  register  and  from  the  input  registers  to  the  functional  unit.  After  all 
instruction  precedence  graphs  have  been  examined,  the  register  transfer  graph 
is  complete. 


i 


i 


20 


This  graph  shows  a minimxjm  set  of  interconnection  paths  between 
registers  and  functional  units  (at  the  expense  of  using  extra  functional  units 
of  some  types).  Using  this  graph,  changes  in  the  complexity  of  interconnections 
can  be  studied  as  changes  of  the  architecture  or  instruction  set  are  made. 

The  graph  is  further  useful  to  determine  effective  CPU  partitioning  into 
several  chips.  If  a partition  were  desired,  a grouping  of  registers  and 
fimctional  units  which  had  few  interconnection  paths  between  them  is  required 

N 

to  minimize  pins.  This  graph  can  be  used  to  determine  highly  connected 
registers  although  a different  version  of  this  graph  is  used  in  Chapter  3 so 
that  partitioning  can  be  studied  more  directly. 

Changes  of  architecture  and  organization  can  be  evaluated  with  the 
graph  by  determining  the  addition  or  reduction  of  data  communication  paths  for 
each  change.  Having  a register  in  the  processor  can  be  traded  off  against 
using  a memory  location  to  store  its  contents  and  adding  a cycle  for  the 
memory  reference.  This  is  done  by  changing  all  instruction  precedence  graphs 
to  correspond  to  the  change  and  redrawing  the  register  transfer  graph.  The 
effect  of  the  change  on  interconnections  can  be  observed  and  a decision  can 
be  made  to  accept  it  or  not. 

An  example  of  this  tradeoff  for  the  instruction  set  used  is  shown 
in  Figure  2.3.  Here  the  stack  pointer  register  is  removed  and  its  contents 
are  stored  in  a fixed  location  of  memory.  This  change  results  in  the  elimina- 
tion of  a register,  but  adds  two  communication  paths  between  the  MD  register 
and  the  MA  register.  Also,  the  functional  unit  performing  additions,  sub- 
tractions and  logical  operations  now  has  three  inputs  instead  of  two.  These 
results  show  that  removing  this  register  increases  connection  complexity  and 
would  likely  be  more  costly  than  including  the  register  itself.  (A  more 


21 


r Diagram. 


Figure  2.3.  Modified  Register  Transfe 


22 


accurate  cost  estimation  requires  data  about  the  technology  used  for  implementa- 
tion.) A decision,  therefore,  was  made  to  keep  the  stack  pointer  register 
as  a physical  register  in  the  processor.  Other  changes  of  architecture  and 
processor  organization  can  be  evaluated  in  a similar  fashion  to  determine 
their  effects  on  connection  requirements. 

Alternatives  to  communication  paths  which  are  costly  to  implement 
can  be  sought  using  the  register  transfer  graph.  The  instructions  that  use  a 
path  which  is  a candidate  for  removal  are  first  determined.  This  is  done  by 
labeling  the  arc  representing  each  communication  path  according  to  which 
instructions  use  it  when  composing  the  register  transfer  graph.  An  alternative 
for  using  this  path  is  found,  if  possible,  for  each  instruction  which  uses 
the  path.  A new  register  transfer  graph  is  drawn  for  this  change  and  the 
differences  noted.  In  some  cases  an  alternative  cannot  be  found  (such  as  for 
the  path  from  the  PC  register  to  the  MD  register  in  Figure  2.2).  This  is 
used  only  for  a JUMP  to  SUBROUTINE  instruction  but  cannot  be  eliminated  and 
so  must  remain. 

Low  usage  paths  can  be  found  by  determining  the  probability  of 
usage  for  each  path.  This  is  done  by  first  composing  a matrix  called  R^,, 
which  has  a column  corresponding  to  each  data  communication  path  and  a row 
corresponding  to  each  instruction.  Each  element  represents  the  number  of 
times  the  conmunication  path  represented  by  that  column  is  used  by  the 
instruction  represented  by  that  row. 


23 


ADD 

OR 

JUMP 

STORE 


A B 

1 0 
0 1 
1 0 
0 1 


• • 

1 0 


C D 

1 1 
1 1 
0 0 
0 0 


1 1 


Next,  Che  instruction  mix  vector,  P^,  is  composed  with  each  element 

equal  to  the  probability  of  use  of  the  instruction  corresponding  to  its  row. 

(The  frequency  of  usage  for  each  instruction  must  be  known  to  apply  this 

procedure.)  Then,  the  vector  P^p,  the  data  communication  path  probability 

T 

vector  is  computed  as  P^p  “ Here  each  element  is  proportional  to  the 

frequency  of  usage  for  the  corresponding  communication  path.  From  this  the 
degradation  of  performance  can  be  found  if  the  path  were  removed. 

Performance  degradation  is  evaluated  as  follows.  Let 
P^  * probability  of  instruction  i, 

Nj^(i)  = number  of  memory  references  for  instruction  i, 

N„(i)  ® number  of  strictly  computation  (non-memory  ref.)  cycles  for 
instruction  i,  and 

T * average  time  of  execution  per  instruction. 


where 


T - S + N^(i)l 


(2.1) 


If  the  removal  of  a data  communication  path,  j,  increases  the  execution  time 
of  those  instructions  which  use  it  by  one  cycle,  then,  letting  b^^^  be  a 
Boolean  variable  equal  to  one  if  instruction  i uses  path  j,  and  zero  otherwise, 


we  have 


24 


T' 


E »c(l)I  -P  2 Pib 

1 1 ■' 


(2.2) 


where  T*  is  the  new  average  instruction  execution  time, 
tion,  at,  is  then 


Z P.b. . 
1 " 


1 


Performance  degrada- 


(2.3) 


For  any  path  j that  is  removed,  if  the  use  of  the  instructions  which  use  this 
path  is  small  the  performance  degradation  will  be  small. 


2.6.  Modified  Precedence  Graph 

The  register  transfer  graph  of  the  previous  section  gives  a minimal 
interconnection  pattern  but  does  not  reveal  which  paths  must  be  duplicated 
because  they  are  used  twice  during  a cycle  or  used  in  different  phases  of  a 
cycle.  For  a single  processor  the  register  transfer  graph  provides  all  the 
needed  information,  but  for  a pipelined  processor,  when  the  data  transfers 
occur  at  different  times  during  a cycle,  a separate  data  communication  path 
and/or  functional  unit  is  required.  The  first  step  toward  developing  a 
timing  graph  of  the  processor  is  to  develop  precedence  graphs  for  each  instruc- 
tion which  indicate  specifically  the  resources  used  at  each  time. 

To  provide  this  information  a modified  precedence  graph  for  each 
instruction  is  formed.  Definitions  for  the  modified  precedence  graph  are 
shown  in  Figure  2.4(a).  An  example  of  this  graph  is  shown  in  Figure  2.4(b) 
for  the  ADD  instruction  used  in  Figure  2.1.  The  precedence  graph  from  Figure 
2.1  is  shown  in  Figure  2.4(c).  The  modified  precedence  graph  exhibits  a one 
to  one  correspondence  between  its  groups  of  inputs,  outputs  and  functional 
operations  and  the  functional  components  of  the  previous  precedence  graph. 


25 


Definitions 


0 

<2> 

Q 

0 


Input  or  WRITE  to  Reg.  J 
Output  or  READ  from  Reg.  J 
Logical  operation  on  data  in  J 
Logical  operation  on  input  to  J 


The  registers  involved  in  each  operation  and  data  transfer  are  indicated  as 
well  as  any  required  inputs  or  outputs  to  this  register.  Required  data 
communication  paths  can  easily  be  determined  since  output  and  input  registers 
are  specified.  The  use  of  functional  units  is  also  specified  in  the  modified 
precedence  graphs.  These  graphs  specify  the  resources  required  and  the 
precedence  of  their  use.  The  modified  precedence  graphs  can  now  be  used  to 
determine  the  time  assignments  of  all  resources  in  the  processor. 

2.7.  The  Time  Assignment  Problem 

In  order  to  relate  the  timing  of  all  the  register  input  operations, 
output  operations,  uses  of  data  transfer  paths  and  functional  operations,  a 
merged  precedence  graph  from  all  the  modified  precedence  graphs  is  composed. 

The  objective  of  the  merged  precedence  graph  is  to  assign  all  operations  and 
data  transfers  to  time  periods  of  the  processor  cycle  while  using  the  same 
resource  (e.g.  output  of  register  MA)  as  few  times  as  possible  in  order  to 
minimize  input-output  ports  of  internal  registers,  data  communication  paths 
and  functional  units. 

The  merged  precedence  graph  is  composed  in  the  following  fashion. 
First  the  topology  of  the  graph  is  defined.  The  graph  is  to  describe  one 
cycle  of  the  processor.  Time  flows  from  left  to  right  with  the  beginning  of 
the  cycle  at  the  leftmost  position  and  the  end  of  the  cycle  at  the  rightmost 
position.  Since  the  first  objective  of  each  cycle  is  to  generate  the  memory 
reference,  the  operations  necessary  to  determine  the  next  memory  address  are 
done  as  soon  as  possibl.e  and  all  remaining  operations  are  fit  into  the  cycle  to 
minimize  resources  provided  they  are  completed  when  that  resource  is  required 


by  a new  processor  cycle. 


27 


The  graph  is  arranged  vertically  with  a row  for  each  register  and 
for  the  memory  data  and  address  ports.  Within  the  graph,  the  same  symbols 
defined  for  the  modified  instruction  precedence  graph  are  used  to  indicate 
register  input  and  output  operations  and  functional  operations.  However,  a 
functional  operation  done  along  with  a data  transfer  is  indicated  as  a circle 
on  the  input  to  the  result  register.  Double-line  arcs  indicate  a data  transfer 
and  are  drawn  from  the  output  register  to  the  input  register.  Precedence  arcs 
are  indicated  by  a single  line  directed  arc  dravm  from  the  input  register  of 
the  preceding  double  arc  (or  single  register  operation)  to  the  output  register 
of  the  following  double  line  arc  (or  single  register  operation). 

To  construct  the  graph,  first  the  left  margin  is  labeled  indicating 

M M 

a row  for  each  register  and  memory  bus  (separate  for  A and  D ).  Then,  the 
modified  precedence  graph  for  each  instruction  is  traced.  Each  operation  in 
the  modified  instruction  precedence  graph  is  labeled  on  the  merged  precedence 
graph  in  the  same  horizontal  positions  as  operations  performed  at  the  same  time 
during  the  cycle  as  operations  from  other  instructions.  The  accuracy  of  the 
initial  horizontal  placement  will  not  affect  the  resultant  graph  but  it  will 
ease  reading  the  initial  merged  precedence  graph.  The  data  transfer  opera- 
tions (e.g.  output  from  memory  data  bus  and  input  to  the  MD  register)  require 
the  input  and  output  resources  to  be  placed  in  the  same  column  and  connected 
with  a double-lined  arc  oriented  in  the  direction  of  flow  of  data.  Any 
precedence  relationships  are  then  indicated  with  a directed  arc  from  the 
preceding  resource  to  the  following  resource. 


A merged  precedence  graph  has  been  constructed  for  the  example 
instruction  set  and  appears  in  Figure  2.5.  When  composing  this  graph  one 


29 


could  also  label  the  precedence  and  data  transfer  arcs  with  the  instructions 
which  use  them,  thus  providing  a measure  of  how  often  each  path  is  used. 

The  correspondence  between  this  graph  and  the  final  processor  is 
that  each  register  input  and  output  in  this  graph  requires  an  input  or  output 
port  associated  with  that  register  in  the  final  processor.  Each  output-input 
pair  in  the  graph  connected  by  a double  line  arc  corresponds  to  a communica- 
tion path  in  the  final  processor.  In  order  to  simplify  the  physical  structure 
of  the  processor,  this  graph  should  be  reduced  as  much  as  possible.  To  do 
this,  merges  are  found  vdienever  possible  where  the  same  input  operation, 
output  operation  or  data  transfer  used  by  different  instruction  cycles  can  be 
merged  to  occur  during  the  same  period  of  the  processor  cycle.  This  merging 
reduces  the  ninnber  of  register  input  ports,  register  output  ports  and  data 
communication  paths  required  in  the  physical  structure  of  the  processor. 

When  finding  merges,  the  most  important  constraint  arises  from  the 
precedence  relationships.  It  would  be  convenient  to  merge  the  graph  so  that 
each  register  has  only  one  input  and  one  output  port  (which  is  the  minimum 
for  each  register),  but  the  precedence  arcs  prohibit  this  when  the  objective 
is  to  generate  a memory  reference  during  every  cycle.  To  determine  whether 
two  inputs  or  two  outputs  can  be  merged,  one  must  examine  all  precedence 
paths  involving  the  two  input-output  pairs  in  question.  When  considering 
an  input  operation  (to  a destination  register),  the  precedence  arc  is  drawn 
from  the  destination  register  of  the  prior  operation  to  the  source  register 
that  is  paired  with  this  input  operation.  To  determine  whether  two  operations 
can  be  merged,  one  must  trace  the  precedence  paths  to  the  beginning  of  the  cycle. 
If,  when  tracing  a precedence  path  back  from  one  operation,  it  is  found  that  it 


30 


must  follow  the  other  operation,  the  two  operations  cannot  be  merged.  If  all 
pairs  of  precedence  paths  (one  from  each  operation)  are  traced  back  either  to 
the  beginning  of  the  cycle  or  to  a common  operation,  the  two  operations  may  be 
merged.  This  procedure  can  be  followed  with  any  pair  of  operations  that  have  a 
possibility  of  being  merged  until  all  the  possibilities  are  exhausted. 

An  example  of  the  merging  procedure  can  be  shown  \ising  Figure  2.5. 
Examine  the  case  of  trying  to  merge  outputs  of  the  MA  register,  numbered  17 
and  20  (each  operation  has  been  numbered  in  this  figure  to  aid  in  explanation). 
Tracing  back  along  the  precedence  arcs  from  operation  20,  it  can  be  seen  that 
I input  18  to  the  PC  register  needs  to  precede  number  20.  Input  18  is  connected 

to  output  17,  however,  by  a double  line  arc  indicating  that  these  must  be 
performed  at  the  same  time.  Therefore  output  17  must  precede  output  20, 
thereby  prohibiting  outputs  17  and  20  from  being  merged. 

Another  case  is  to  examine  the  possibility  of  merging  inputs  6 and 
15  of  the  MD  register.  Input  6 has  no  precedence  arcs  leading  to  it  but  must 
occur  at  the  same  time  as  output  11  of  the  PC  register.  Output  11  is  preceded 
by  input  1 which  occurs  at  the  beginning  of  the  cycle.  Input  6 is  thus  only 
preceded  by  1.  Input  15  has  no  precedence  arcs  e'ither  but  must  occur  with 
output  16  of  the  A register.  Output  16  is  preceded  only  by  input  1.  There- 
fore inputs  6 and  15  are  both  preceded  by  the  same  operation  and  can  be 
merged. 

It  is  also  possible  to  merge  an  input  and  an  output  of  the  same 
register  into  the  same  time  phase  in  accordance  with  the  precedence  relation- 
ships (assuming  registers  consist  of  master-slave  or  edge  triggered  flip-flops). 
Several  of  these  merges  have  been  performed  (e.g.  input  13  and  output  17  of 
the  MA  register)  for  the  example  in  Figure  2.5  and  the  final  set  of  operations 


3L 


has  been  grouped  into  columns  each  of  which  represents  one  phase  of  the  pro- 
cessor cycle.  The  resultant  merged  precedence  graph  is  shown  in  Figure  2.6. 

2.8.  Processor  Layout 

The  final  form  of  the  merged  precedence  graph  can  be  used  to  give 
the  layout  of  the  processor.  The  columns  of  Figure  2.6  have  grouped  the 
register  transfer  and  functional  operations  into  a set  of  time  slots  which 
must  occur  in  order  in  the  processor.  Each  of  these  columns  therefore  corre- 
sponds to  a phase  in  the  pipelined  implementation  provided  none  required  an 
excessive  amount  of  time  to  be  performed  in  which  case  they  may  be  divided 
into  two  phases.  The  double-lined  arcs  define  data  transfer  paths  that  are 
needed  in  the  final  processor.  All  inputs  and  outputs  of  registers  are  also 
defined  by  Figure  2.6.  Figure  2.7  shows  the  layout  of  the  processor  for  the 
example  as  obtained  from  the  merged  precedence  graph  of  Figure  2.6. 

A copy  of  each  register  is  needed  for  each  phase  of  the  processor 
(corresponding  to  each  column  of  Figure  2.6).  The  appropriate  connection 
paths  are  added  during  the  phase  to  which  they  were  assigned  in  Figure  2.6. 
Each  register  is  also  connected  from  its  predecessor  time  copy  and  to  its 
following  time  copy.  Data  transfers  are  made  by  selecting  the  appropriate 
input  to  each  register  from  among  the  previous  time  copies  of  the  register  and 
the  various  register  and  functional  unit  outputs  connected  to  the  selector. 

The  selector  is  a simple  multiplexer  with  control  inputs  from  the  CPU  control 
logic.  Functional  operations  are  selected  by  the  same  method  as  selecti.ig 
data  transfers.  The  memory  data  and  address  busses  are  connected  to  the 
assigned  phases  as  in  Figure  2.6.  The  outputs  of  the  register  on  the  far 
right  are  just  connected  to  those  at  the  beginning  of  the  cycle  at  the  far 


33 


Figure  2.7.  Final  Layout 


34 


left  of  Figure  2.7.  The  operation  of  this  processor  is  as  follows.  One 
column  of  Figure  2.7  contains  the  information  for  one  "process."  A process 
is  a program  or  portion  of  a program  which  can  run  independently  of  any  other 
programs  currently  running  in  the  processor.  During  each  phase  of  the  proces- 
sor, each  process  is  transferred  one  column  to  the  right  in  the  layout  of 
Figure  2.7.  The  process  at  the  last  position  is  transferred  to  the  first 
position  at  the  beginning  of  the  pipeline.  Each  column  of  the  pipeline  has  a 
set  of  data  transfer  paths  and  logic  units  which  can  be  used  only  by  the  pro- 
cess at  that  position  in  the  pipeline.  Therefore,  each  process  must  make  use 
of  each  of  its  resources  during  the  phase  in  which  the  process  is  at  the 
location  of  the  pipe  at  which  the  resource  is  located. 

A multiprocess  pipeline  has  been  organized  where  functional  units 
and  data  transfer  paths  are  shared  by  several  processes  during  one  processor 
cycle.  However  a copy  of  each  register  must  exist  for  each  processor  in  the 
pipeline . 

Consider  one  column  of  the  pipeline  and  the  resources  assigned  to  it. 
During  each  phase,  a new  process  is  at  that  position  of  the  pipeline  and  may 
utilize  those  resources.  Therefore  during  one  cycle  of  the  processor,  one 
resource  or  set  of  resources  can  be  used  S times  where  S is  the  number  of 
phases  per  cycle  (or  number  of  positions  in  the  pipeline).  For  S processes, 
only  one  copy  of  each  resource  is  needed  since  the  use  of  each  resource  is 
shared.  The  memory  data  and  memory  address  busses  are  resources  assigned  to 
phases  and  can  be  used  during  each  time  phase  by  some  process.  If  each 
process  executes  in  a manner  which  requests  one  memory  reference  for  each 


cycle  of  the  processor,  the  memory  data  and  address  busses  will  be  utilized 


35 


every  time  phase,  thereby  obtaining  maximum  usage  of  these  resources.  These 
are  the  most  costly  and  most  limiting  resources  since  they  require  pins  and 
high  utilization  will  produce  high  throughput  for  the  processor. 

The  structure  described  in  Figure  2.7  is  a highly  regular  structure 
of  registers  which  will  reduce  the  area  needed  for  interconnections  within  the 
chip.  Since  each  of  the  processes  running  in  this  processor  is  independent 
and  each  resource  can  be  utilized  only  when  the  process  is  located  at  its 
position  in  the  pipeline,  there  exist  no  possible  access  conflicts  (collisions) 
[SHA74]  for  the  use  of  any  resource.  This  elimination  of  conflict  resolution 
enables  simplified  control  of  the  pipeline  since  each  process  can  be  controlled 
in  the  same  independent  manner  as  it  would  be  if  it  were  running  on  a single 
microprocessor.  Some  specific  control  structures  which  can  be  applied 
directly  to  the  organization  are  discussed  in  [KOG77]. 

2.9.  Conclusions 

A method  of  organizing  a pipelined  processor  suitable  for  imple- 
mentation with  LSI  has  been  described.  It  assumes  that  an  instruction  set  is 
chosen  which  is  designed  to  efficiently  perform  the  intended  job  load  of  the 
computer.  The  method  presented  decomposes  these  instructions,  formulates 
which  capabilities  the  processor  must  include  and  creates  precedence  relation- 
ships among  the  operations  of  the  processor.  These  operations  are  then 
combined  when  possible  to  reduce  the  complexity  of  processor  and  a final 
layout  is  obtained.  Graphical  constructs  aid  in  the  analysis  of  alterations. 

The  pipelined  structure  does  not  require  complex  control  structures 
due  to  the  independence  requirement  of  the  processes.  The  regular  structure 
also  reduces  interconnection  costs  and  complexities, which  is  necessary  to  make 
efficient  usage  of  chip  area  in  a LSI  implementation. 


36 


jg 


1 


I 


i 

I 


Thus  a simple  structure  is  obtainable  for  a pipelined  processor 
suitable  for  LSI  implementation.  The  additional  requirements  needed  for  this 
processor  over  a single  processor  are  the  additional  area  for  the  time  replica 

tions  of  each  register  and  the  necessary  software  to  schedule  the  independent 
processes . 


37 


CHAPTER  3 . PARTITIONING 


3.1  The  Partitioning  Problem 

When  a system  Is  too  large  to  be  Implemented  on  one  chip.  It  must 
be  partitioned  into  small  enough  parts  to  be  implemented  with  a 5*t  of 
chips.  As  discussed  in  Chapter  1,  there  are  two  constraints  which  limit 
what  can  be  implemented  on  a single  chip.  These  are  the  total  chip  area 
allowed  and  the  total  number  of  pins  allowed  per  chip.  These  constraints 
apply  to  partitioning  as  well  as  single  chip  design. 

In  reality,  these  constraints  are  soft  constraints  but  as  they 
are  exceeded,  cost-effectiveness  falls  sharply.  For  use  in  this  research, 
a hard  limit  will  be  assumed.  When  one  of  the  two  constraints  are  not 
met,  a partition  which  reduces  the  chip  area  or  pin  requirements  per  chip 
must  be  found. 

When  it  is  determined  that  a partition  is  required,  the  structure 
of  the  system  to  be  partitioned  must  be  studied  to  find  sets  of  units 
which  have  few  I/O  connection  paths  between  them.  It  is  usually  easy  to 
partition  into  several  units  with  smaller  area,  but  it  is  usually  diffi- 
cult to  achieve  only  a small  number  of  connections  between  units.  With  many 
partitions,  the  number  of  pins  per  chip  will  actually  increase.  Typically, 
the  number  of  pins  required  for  the  partitioned  chip  set  is  significantly 
higher  than  the  number  of  pins  required  if  the  system  were  implemented 
with  one  chip.  Partitions  are  sought  which  minimize  the  number  of  pins  re- 
quired. This  chapter  formulates  and  studies  several  general  classes  of 
partitions  and  compares  the  effectiveness  of  two  most  useful  methods  of 


partitioning. 


38 


3.2  Approaches 

There  are  several  standard  ways  of  partitioning  a chip  to  reduce 
the  required  number  of  pins  [PEN72] . The  application  of  these  standard  ap- 
proaches to  a pipelined  CPU  studied  as  well  as  an  approach  specific  to  a 
pipelined  CPU. 

The  first  method  of  partitioning  is  called  a register  partition 
(or  functional  partition) . The  register  partition  divides  the  processor 
into  sets  of  functionally  related  registers  and  functional  units.  Figure 
3.1  shows  an  example  of  a register  partition  of  a CPU,  Here  the  processor 
is  divided  into  three  sections,  the  control  section,  the  addressing  section 
and  the  data  section.  These  sections  are  determined  by  finding  groups  of 
registers  which  have  many  connection  paths  between  them  (see  Figure  2.2), 

A partition  separating  these  highly  connected  groups  would  require  many 
pins.  The  register  partition  places  each  of  these  groups  of  highly  inter- 
connected registers  on  the  same  chip  and  pins  are  added  only  to  provide 
connection  paths  between  sections. 

The  second  partitioning  method  is  called  a bit  slice  partition. 

A bit  slice  partition  places  several  bits  of  each  of  the  registers  and 
functional  units  onto  the  same  chip  according  to  their  relative  bit  posi- 
tion in  the  data  word.  Figure  3.2  is  an  example  of  a bit  slice  partition 
formed  by  dividing  the  processor  into  a high  order  half  and  a low  order 
half.  All  components  processing  the  low  order  bits  of  data  words  are 
placed  on  the  low  order  chip  and  likewise  for  the  high  order  chip.  Bit 
slice  partitions  may  be  made  as  finely  as  a one  bit  wide  slice  per  chip. 

This  type  of  partition  is  good  for  CPU  structures  which  have  few  connections 
between  bit  levels  but  many  connections  within  each  bit  level. 


MA 


MD 


CONTROL 


41 

The  third  type  of  partition,  called  the  time  slice  partition. 

I places  the  registers  and  components  which  belong  to  the  same  phase  or  group 

of  phases  of  the  processor  cycle  on  the  same  chip.  An  example  of  this 
method  is  shown  in  Figure  3.3.  This  partition  is  applicable  only  to  the 
pipelined  processor.  It  does  not  require  pins  for  data  transfer  paths  con- 
necting registers  and  functional  units  assigned  to  the  same  phase  of  the 

processor  cycle.  ' 

The  time  slice  partition  just  described  divides  the  processor 
I into  several  sections  each  containing  a portion  of  the  processor  cycle.  ^ 

I 

I This  partition  requires  pins  on  each  chip  to  communicate  the  signals  which 

1 , 

must  be  transferred  from  one  phase  of  the  processor  to  the  next.  Figure 

I 

3.3  demonstrates  the  connection  paths  cut  for  the  processor  described  in 
Chapter  2.  From  this  diagram  it  is  seen  that  for  each  distinct  register  on 
a chip,  two  data  word  width  paths  are  cut. 

If  each  data  word  is  bits  wide,  pins  are  required  per  chip  | 

per  register.  This  is  an  exorbitant  number  of  pins,  e.g.  , for  a six  regis- 
ter CPU  with  a 16  bit  data  word,  this  partition  would  require  at  least  192 
pins  per  chip!  Time  slicing  is  therefore  discarded  as  a practical  method 
of  partitioning. 

This  analysis  thus  shows  that  all  phase  replications  of  a register 
should  be  placed  on  the  same  chip.  This  restriction  reduces  the  many  pos- 
sible partitions  and  simplifies  the  analysis  of  the  other  types  of  parti- 
tions. 

A special  case  of  the  time  slice  partition  is  a process  partition 
where  each  chip  contains  all  hardware  necessary  for  one  process.  It  basic- 
ally results  in  a set  of  independent  parallel  processors.  However,  each 


42 


43 


process  remains  in  its  chip  and  does  not  move  from  chip  to  chip  to  execute 
a cycle.  This  partition  is  similar  to  a time  slice  for  each  phase  with  all 
control  and  functional  units  added  to  each  chip,  although  some  functional 
unit  sharing  within  the  chips  (as  in  an  ALU-bus  organization)  is  possible. 
This  partition,  however,  does  not  use  pins  efficiently  and  involves  much 
duplicated  hardware  and  internal  interconnection  area.  Since  this  parti- 
tion requires  much  poorly  utilized  hardtvare,  it  will  be  discarded. 

Only  the  bit  slice  and  register  partitions  are  considered  further. 

3.3  Bit  Slice  Analysis 

The  bit  slice  partition  is  a method  used  for  dividing  the  proces- 
sor in  many  commercially  produced  microprocessors.  The  idea  of  the  bit 
slice  is  that  most  data  connection  paths  exist  between  registers  and  func- 
tional units  within  the  same  bit  level  of  each  device.  Connections  between 
bit  levels  are  few  (carries  for  addition  and  link  bits  for  shifts  are  two 
common  examples).  Bit  slice  partitioning  thus  appears  to  be  attractive. 
However,  all  control  signals  must  be  connected  to  each  chip,  since  each  bit 
level  still  requires  all  control  signals.  Control  pins  per  chip  are  there- 
fore not  reduced.  Since  all  control  signals  must  be  connected  to  each  chip 
and  the  control  section  is  not  easily  partitioned  between  bit  levels,  the 
entire  control  section  for  the  bit  sliced  partition  is  generally  implemented 
on  one  chip  (or  more)  separately  from  the  chips  of  the  bit  sliced  processing 
section.  For  each  bit  slice  partition  the  number  of  data  pins  in  the  unpar- 
titioned chip  is  divided  evenly  among  the  partitioned  chips,  link  data  pins 
are  added  and  all  control,  power  and  clock  lines  appear  on  each  chip. 


44 


In  order  to  analyze  the  bit  slice  partition,  a model  of  the  pin 
requirements  is  needed.  Pins  of  the  original  chips  must  be  classified  and 
the  number  of  link  data  pins  determined.  The  classification  scheme  is  de- 
fined below.  The  control  section  itself  is  not  modeled. 

Np  = number  of  bits  in  the  data  word 

= total  number  of  control  pins  from  control  section  to  pro- 
cessor section 

N^p  = the  number  of  clock  and  power  pins  required 

= number  of  carry,  link  and  other  sense  pins  required  be- 
tween bit  levels 

D = number  of  Np  wide  I/O  ports  to  the  entire  CPU 

P^  = average  number  of  pins  per  chip  for  an  i chip  partition 

P^  = total  number  of  pins  in  entire  system 

Using  these  pin  classifications,  the  number  of  pins  required  by 
the  processing  section  if  it  were  implemented  with  one  chip  is 


P^=DN^+Nc+Ncp. 


(3.1) 


For  a bit  slice  partition  with  k chips,  assuming  k divides  Np  and 
k > 1,  the  number  of  pins  per  chip  becomes 


"■k  ■ "XT'  ^ \ * “cp  + "cl- 


(3.2) 


The  term  in  this  equation  which  is  reduced  is  the  term,  i.e.,  the  data 
word  width  per  chip  is  reduced  by  a factor  of  k by,  this  partition.  The  two 
terms  that  are  not  reduced  by  this  partition  are  and  N^p  which  represent 
control,  clock  and  power.  Carry  and  link  pins  are  added  to  each  chip  when 


a 


f I 

I 

I 

J 


45 


k > 1.  pins  must  be  included  for  each  partitioned  chip  and 

provide  a lower  bound  on  the  number  of  pins  per  chip  for  all  values  of  k. 

The  reduction  in  the  number  of  pins  per  chip  in  going  from  a one 
chip  implementation  to  a two  chip  bit  slice  partition  is 

N 


APi^2  - M-l)  - V- 


(3.3) 

The  reduction  in  the  number  of  pins  caused  by  successively  finer 
bit  slice  partitions  is 


"’’k.k+i  ■ - kJi)- 


k > 2. 


(3.4) 


This  continuous  approximation,  assuming  all  k divide  provides 
some  intuition  about  the  bit  slice  partitioning  process.  This  analysis 
shows  that  in  addition  to  the  pin  reduction,  going  from  k = 1 to  k = 2 
incurs  a penalty  of  N pins,  the  number  of  pins  required  for  carry  and 

L>1j 

link  information.  As  bit  slice  partitioning  continues  with  larger  values  of 
k,  Eq  3.4  shows  that  the  pin  reduction  is  lower  since  k is  increasing  and  as 
k increases,  the  term  (•”  - decreases  as>Tnptotically  to  zero.  The  fol- 

lowing is  an  example  of  applying  the  bit  slice  partition  to  the  data  hand- 
ling part  of  the  processor  used  as  an  example  in  Chapter  2.  The  values  for 
all  variables  are:  N = 16,  N = 12,  N = 4,  N = 6,  D = 2.  Then: 


= 2(16)  + 12  + 4 = 48 


P.p  = 48 


P = 2(^)  + 12  + 4 + 6 = 38 


= 76 
T 


P^  = 2(i|)  + 12  + 4 + 6 = 30  P,  = 120 


46 


Pg  = 2(-g)  + 12  + 4 + 6 = 26  P^  = 208 

Pl6  = 2(^)  + 12  + 4 + 6 = 24  = 384. 

Considering  the  same  processor  but  with  an  eight  bit  word  width 
instead  of  16  gives: 


P 


P 


P 


P 


1 

2 

4 

8 


2(8)  + 12  + 4 = 32 

2(|)  + 12  + 4 + 6 = 

2(7-)  + 12  + 4 + 6 = 

2(|)  + 12  + 4 + 6 = 


P.J.  = 32 

30 

P^  = 60 

26 

P,  = 104 
T 

24 

P,  = 192 
T 

In  the  first  example  a reasonable  reduction  was  obtained  for  the 
first  two  partitions , but  by  the  third  partition  there  was  a reduction  of 
only  four  pins  per  chip.  The  total  number  of  pins  for  the  entire  system  is 
also  interesting.  They  increase  drastically  as  partitioning  continues.  The 
whole  system  cost  increases  with  the  total  number  of  pins  indicating  that 
whenever  possible  partitioning  should  be  avoided  and  used  only  to  the  ex- 
tent necessary  when  required. 

In  the  second  example,  when  the  data  word  width  is  small  causing 
very  little  reduction  in  pins  per  chip  to  be  achieved  by  any  of  the  parti- 
tions. Again,  the  total  number  of  pins  increases  rapidly  as  in  the  first 
example.  In  fact,  by  the  third  partition  there  is  only  a difference  of 
16  total  pins  out  of  200  between  the  wide  word  case  and  the  small  word 
case.  This  indicates  that  many  duplications  of  pins  are  needed  at  this  high 
degree  of  partitioning. 


47 


In  summary,  the  bit  slice  partition  reduces  data  pins  per  chip 
with  each  partition  but  there  is  an  initial  penalty  for  partitioning  and 
the  amount  of  pins  reduced  decreases  as  partitioning  increases.  The  bit 
slice  partition  gives  good  results  for  the  first  several  partitions  of  a 
processor  with  a wide  data  word  but  does  not  do  very  well  when  the  data 
word  width  is  small  with  respect  to  the  number  of  control  pins.  The  total 
number  of  pins  for  the  system  increases  rapidly  with  increasing  partition- 
ing. Bit  slice  partitioning,  therefore,  should  only  be  performed  if  abso- 
lutely necessary  since  the  system  implementation  pin  cost  is  great  with  a 
high  degree  of  partitioning. 

3.4  Connection  Graph  Model 

The  register  slice  or  functional  partition  divides  the  processor 
by  placing  related  sets  of  entire  registers  and  functional  units  on  sepa- 
rate chips.  The  pins  required  for  each  chip  in  the  register  partition  are 
the  power  and  clock  pin,  the  control  pins  needed  to  carry  only  those  con- 
trol signals  for  the  units  on  each  particular  chip  and  the  pins  that  send 
the  data  to  and  from  the  set  of  registers  and  functional  units  on  each  chip. 
With  this  partitioning  method,  it  is  not  so  straightforward  to  determine 
the  best  partition  as  with  bit  slice  partitioning.  A partition  is  needed 
that  separates  the  processor  into  sets  with  a minimal  number  of  communi- 
cation paths  between  them. 

A graphical  representation  of  the  processor  aids  in  understanding 
the  connection  pattern  between  registers  and  functional  units  and  in  finding 
good  register  partitions.  This  graphical  representation,  called  a connection 


48 


graph,  is  described  here  and  will  be  used  later  to  find  register  parti- 
tions. 

The  connection  paths  in  a processor  layout  are  composed  of  con- 
trol signal  lines  and  data  path  lines.  The  data  path  lines  however  consist 
of  Njj  (=  number  of  bits  in  the  data  word)  lines  in  parallel.  Most  proces- 
sors have  data  word  widths  of  at  least  8 or  16  bits.  The  number  of  connec- 
tions required  for  the  data  path  lines  to  each  register  and  functional  unit 
are  usually  much  greater  than  the  number  of  connections  for  control  lines 
to  each  register  and  functional  unit.  The  choice  of  a good  register  parti- 
tion is  thus  influenced  mainly  by  data  lines.  Therefore,  when  construct- 
ing the  connection  graph,  the  control  lines  are  omitted.  If  a more  de- 
tailed connection  pattern  is  required,  the  control  lines  may  be  included, 
but  their  omission  eases  analysis  without  significantly  affecting  results. 

In  order  to  graph  the  pipelined  processor  of  Chapter  2,  a simpli- 
fication can  be  made  due  to  the  results  of  Section  3.2.  There  it  was  found 
that  any  partition  that  separated  time  replications  of  registers  would  re- 
quire an  excessive  number  of  pins  and  would  thus  be  impractical.  Therefore 
all  phase  replications  of  a register  will  be  placed  on  the  same  chip  and 
for  partitioning  purposes  each  register  need  only  be  modeled  by  a single 
register  copy.  Among  the  register  connection  paths,  only  those  that  con- 
nect different  registers  need  be  modeled  for  partitioning. 

The  processor  can  thus  be  modeled  using  one  vertex  for  each  re- 
gister in  the  processor  and  an  edge  of  weight  one  between  each  set  of  re- 
gisters for  each  data  communication  path  between  them.  Using  this  model 
the  graph  of  the  processor  of  Chapter  2 is  shown  in  Figure  3.4. 


J 


50 


The  functional  units  are  not  represented  in  this  graph  since  their 
outputs  are  connected  only  to  one  register  and  have  been  grouped  with  that 
output  register.  This  graph  is  basically  a simplification  of  Figure  2.2 
but  includes  any  connection  paths  added  after  the  time  assignment  of  each 
resource  has  been  made.  This  graph  includes  only  the  connection  paths  and 
registers,  which  is  all  the  information  needed  for  register  partitioning. 

After  some  study  of  this  graph  it  becomes  apparent  that  it  has 
some  deficiencies.  The  situation  where  there  is  a "Y"  in  the  communication 
path  is  not  accurately  modeled.  A "Y"  in  the  connection  path  occurs  when 
an  output  is  connected  to  two  or  more  inputs.  If  the  case  occurred  that 
both  input  vertices  were  to  be  grouped  on  one  chip  and  the  output  vertex  on 
another,  the  obvious  procedure  would  be  to  use  one  connection  from  the  out- 
put  vertex  to  the  second  chip  containing  both  input  vertices  and  dividing 
the  connection  path  within  the  second  chip.  This  connection  method  cannot 
be  obtained  easily  by  using  the  graph  in  Figure  3.4.  Therefore  a somewhat 
different  model  is  needed. 

An  alternative  is  to  model  each  communication  path  as  a vertex 
also,  and  use  an  edge  to  connect  each  register  vertex  to  the  communication 
path  vertex  and  another  edge  to  connect  the  communication  path  vertex  to 
the  next  register  vertex.  Thus  a communication  vertex  can  be  assigned  to 
a chip  along  with  one  of  the  two  register  vertices  showing  only  one  I/O 
data  connection  path  into  or  out  of  the  chip.  This  modification  produces 
the  graph  shown  in  Figure  3.5. 

This  modification  gives  a model  which  will  lead  to  fewer  communi- 
cation pins  per  chip  but  in  its  present  forni,  Figure  3.5  is  unnecessarily 


52 


1 


complex.  Where  a vertex  is  used  to  represent  a communication  path  which 
involves  data  transfer  between  only  two  registers,  there  are  only  two 
edges  connected  to  it.  Adding  the  communication  vertex  to  one  chip  or  the 
other  (provided  the  two  registers  are  on  different  chips)  does  not  change 
any  connection  requirements.  Therefore  the  communication  path  vertex  can 
be  removed,  thence  requiring  only  one  edge  between  the  two  corresponding 
register  vertices.  Thus  all  communications  vertices  with  only  two  edges 
connected  to  them  can  be  eliminated  and  the  two  corresponding  edges  re- 
placed with  one.  Figure  3.6  shows  the  graph  of  Figure  3.5  after  making 
this  simplification. 

3.5  Register  Slice  Analysis 

The  final  connection  graph  developed  in  Section  3.4  has  an  arc 
for  each  data  path.  By  examining  this  graph  model  one  can  see  ways  of  par- 
titioning which  do  not  cut  too  many  data  paths.  The  graph  in  Figure  3.7 
shows  the  processor  of  Figure  3.6  with  one  possible  partition. 

Dashed  lines  show  the  original  two  I/O  buses  (MEM  ilDDRESS  and 
MEM  DATA)  and  a register  partition  which  cuts  only  one  data  path.  However, 
this  partition  may  not  div’ide  the  area  properly  and  a different  partition 
may  have  to  be  found. 

The  connection  graph  can  be  used  to  examine  various  other  pos- 
sibilities and  to  determine  how  many  data  paths  will  be  cut  in  each  case. 
Good  partitions  result  when  few  data  paths  are  cut  and  the  implementation 
area  needed  by  each  component  of  the  partition  is  reasonable.  Furthermore, 
good  cut  sets  often  divide  the  system  between  existing  input  output  buses. 
In  the  processor  example  above,  since  the  system  has  only  two  existing  I/O 


A 


Figure  3.7.  Register  Partition  of  Example  Processor. 


I 


F/*-5701 

Figure  3.8.  Partition  of  a Special  Four  Bus  Chip. 


55 


buses,  any  partition  that  does  not  divide  the  chip  implementation  between 
the  two  buses  involves  one  chip  with  at  leapt  three  I/O  connection  paths, 
two  for  the  existing  buses  and  at  least  one  more  bus  to  connect  the  chip 
with  some  other  chip  (provided  the  original  connection  graph  was  a con- 
nected graph) . Other  chips  could  actually  have  more  data  pins  than  the 
original  single  chip  implementation.  When  a substantial  increase  in  the 
number  of  data  pins  occurs,  it  is  difficult  or  impossible  to  make  up  the 
difference  by  reducing  other  types  of  pins. 

A distinct  advantage  of  the  register  partition  lies  in  the  fact 
that  partitioning  between  existing  I/O  buses  can  reduce  the  number  of  I/O 
connection  paths  per  chip  significantly.  A special  example  of  this  is 
shown  in  Figure  3.S  where  a four  bus  chip  is  partitioned  into  two  three 
1/0  connection  path  chips.  However  this  type  of  reduction  is  only  pos- 
sible in  special  cases. 

In  order  to  analyze  the  effects  of  a register  partition  numeric- 
ally, a model  similar  to  the  one  used  for  the  bit  slice  partition  in  Sec- 
tion 3.4  is  used.  The  following  definitions  apply: 

Np  = number  of  bits  in  data  word 

N = number  of  control  signal  pins 

Npp  = number  of  clock  and  power  pins, 

Cp  = number  of  data  communication  paths  cut  by  a partition 
n 

into  n chips 

n = number  o'  chips  in  partition 

D = number  of  I/O  buses 

P = average  number  of  pins  per  chip  for  an  n chip  partition 


56 


T 


n.m 


= total  number  of  pins  in  system 

= reduction  of  pins  per  chip  from  n to  m chip  partitions 


To  be  consistent  with  the  bit  slice  partition  the  control  section  will  be 
assumed  to  be  separate  from  the  processing  section.  If  the  control  sec- 
tion were  included  with  one  chip  of  the  processing  section,  the  control 
pins  necessary  to  connect  control  signals  to  the  rest  of  the  processing 
section  would  be  added  to  the  chip  with  the  control  section. 

For  each  path  cut  in  a partition,  two  additional  I/O  communica- 
tion paths  are  created  which  require  pins  each  in  the  total  system.  This 
case  applies  except  for  dividing  paths  connected  to  a "Y”,  in  which  case  one 
must  consider  the  number  of  chips  the  "Y"  bus  is  divided  among.  For  v 
chips,  V I/O  connection  paths  are  required.  For  each  normal  path  cut, 
pins  are  added  to  the  system.  For  cutting  of  a "Y"  path,  vN^^  pins  are  added 
to  the  system.  Thus  the  "effective"  number  of  paths  cut  is  one  in  the  for- 
mer case  and  v/2  in  the  latter  case.  In  the  following,  Cp  represents  the 
total  effective  number  of  paths  cut  by  an  n chip  partition.  The  number  of 

pins  added  to  the  system  (among  all  chips)  by  cutting  of  paths  is  20^  . 

n 

The  equation  for  the  number  of  pins  required  in  a one  chip  imple- 
mentation, as  before,  is 


P,  = + N 


CP 


(3.5) 


If  a register  partition  is  performed  by  dividing  into  n chips,  the  average 
number  of  pins  per  chip  is 


P 

n 


^^P  N 


n 


(3.6) 


57 


This  formula  only  applies  to  the  average  number  of  pins  per  chip,  since 

the  original  number  of  I/O  buses  and  number  of  control  pins  can  only  be 

divided  approximately  evenly  amongst  the  chips.  Furthermore,  it  is  assumed 

that  the  partition  does  not  interfere  with  the  compact  encoding  of  control 

signals,  i.e.,  that  the  N pins  can  be  divided  among  the  chips  without  any 

extra  pins  being  required.  By  examining  Equation  3.6,  one  can  see  that  the 

terms  DN  and  N are  reduced  by  a factor  of  n,  but  there  is  an  average  in- 
U L/ 

crease  of  (2Cp  pins  per  chip.  The  term  (20^  Nj^)/n  represents  the  fact 

that  each  effective  data  path  cut  by  a partition  requires  pins.  This 

increment  is  averaged  over  the  n chips.  The  register  partition  is  thus 

most  effective  when  the  number  of  control  lines  is  large  compared  to  the 

data  word  size  However,  effectiveness  also  requires  low  values  of  Cp  . 

n 

Under  these  assumptions  the  net  reduction  of  pins  per  chip  when 
going  from  n chips  to  n+1  chips  n+1^  found  as  follows: 


AP  . 1 = P - P 
n , n+1  n n+1 


(3.7) 


Using  Equation  3.6  for  n and  n+1  and  substituting  into  Equation  3.7  gives 


D+2C, 


D+2C 


n , n+1  n 


V 


.)V  + 

D n 


P . 1 

nii)  N ^ 

n+1  D n+1 


(3.8) 


Combining  terms  gives 


AP 


n,n+l 


I>f2(n+l)C  -2nC 
< TTTTT? 


N, 


n (n+1) 


D n(n+l) 


Let  AC^  be  defined  as  C 

'^n.n+l  n+1 


- C 


■p  ■ 

n 


(3.9) 


Then 


58 


I>f2Cp 

AP  = ( — 

n,n+l  '^n(n+l) 


2AC 


n+l 


N. 


•)N^  + 


n+l  D n(n+l) 


(3.10) 


It  can  be  seen  from  Equation  3.10  that  if  AC 


is  too  large. 


n+l 

the  actual  number  of  pins  will  increase  instead  of  decrease,  indicating  a 

poor  partition.  In  order  for  the  number  of  pins  to  decrease,  AP  > 0. 

n,n+l 

The  range  of  values  of  AC  which  satisfy  this  is 

^n+1 


AC, 


n+l  D n 


(3.11) 


Equation  3.11  must  be  satisfied  in  order  for  the  partition  to  be  practical. 

If  there  is  no  partition  for  which  Equation  3.11  is  satisfied,  a different 

partitioning  method  should  be  used. 

The  trends  involved  may  be  demonstrated  by  using  the  example  in 

Section  3.4.  The  values  for  the  parameters  are:  N = 16,  D = 2,  N =12, 

D C 

N^p  = 4.  The  number  of  pins  for  a one  chip  implementation  is 


= 2 X 16  + 12  + 4 = 48 


P.P  = 48 
T 


If  ACp  = 1 for  n = 2 then  Cp  = 1 and 
^2  = (^^^)  16  + ^ + 4 = 42 


Pp  = 84 


If  AC  = 2 for  n = 3 then  C = 3 and 
3 ^3 

P3  = (^^^^)  16  + ^ + 4 = 50  2/3 


Pp  = 152 


If  AC  = 2 for  n = 4 then  C =5  and 
4 U 


,2+2x5 


12 


P,  = (-f^)  16  + ^ + 4 = 55 


Pp  = 220 


59 


n 


For  this  example,  pins  decrease  on  the  first  cut  when  AC  =1, 

^2 

but  for  further  cuts,  AC^  = 2 and  the  number  of  pins  per  chip  increases 

‘ ii 

with  each  partition.  Thus,  for  the  example  used  here,  the  register  parti- 
tion is  not  feasible  for  reducing  pins  per  chip  after  the  first  cut. 

There  is  a way  of  reducing  the  pins  of  a four  chip  partition  be- 
low that  of  either  a bit  slice  or  register  slice  partition.  This  is  ac- 
complished by  doing  both  one  register  partition  and  one  bit  slice  partition. 
Each  partitioning  method  is  thus  used  at  a stage  when  they  are  most  effec- 
tive. The  results  of  the  register  slice,  bit  slice  and  combination  of  both, 
applied  to  the  example  processor  used  in  this  research  for  two  cases  with 
= 16  and  = 8 are  listed  in  Table  3.1. 

In  the  previous  example,  the  particular  system  used  had  AC^  = 2 

n 

after  n - 2.  The  ACp  term  then  dominates  for  such  a simple  system  as  n 

n 

grows.  It  is  useful  to  examine  Che  trend  for  systems  in  which  cuts  are 
possible  with  AC  small.  Taking  the  values  of  N , D and  N as  in  the 

i U L»t 

n 

previous  example  but  assuming  AC^  = 1 for  the  first  three  cuts  gives: 

n 


= 2 X 16  + 12  + 4 = 48 


= 48 


'’+2x1  12 

^2  “ 16  + ^ + 4 = 42 


P^  = 84 
T 


P = 16  + ^ + 4 = 39 

4 4 4 


P.„  = 156 
T 


This  gives  a continued  successive  reduction  of  pins  per  chip  as  n grow’s  in 
the  meaningful  range.  However,  as  in  the  bit  slice  case,  the  reduction 
per  cut  decreases  and  the  total  number  of  pins  required  increases  rapidly. 
These  results,  for  both  = 16  and  = 8 are  listed  in  Table  3.2  with 


I 


the  results  of  the  bit  slice  and  combination  partitioning  methods. 


60 


Table  3.1.  Partitioning  Results  for  Example  Processor 

[ 

I 


Number  of 
Chips 

= 16,  D = 2 

N ^ 

= 8,  D = 2 

Combi- 

Bit  Slice  Reg  Slice  nation 

Bit  Slice 

Reg  Slice 

Combi- 

nation 

1 

48 

48 

32 

32 

2 

38 

42 

30 

26 

3 

50  2/3 

29  1/3 

4 

30 

55  29 

26 

31 

21 

5 

57  3/5 

32 

6 

59  1/3 

32  2/3 

Table 

3.2. 

Summary  of  Partitioning 

Results  for 

Various  Machines 

No. 

"d  = 

16,  D 

= 2 

"d  = 

16,  D 

= 4 

o 

II 

8,  D = 

2 

= 8,  D 

= 4 

of  Bit 

ChipsSlice 

Reg 

Slice 

Comb 

Bit 

Slice 

Reg 

Slice 

Comb 

Bit 

Slice 

Reg 

Slice 

Comb 

Bit 

Slice 

Reg 

Slice 

Comb 

1 

48 

48 

80 

80 

32 

32 

48 

48 

2 

38 

42 

54 

58 

30 

26 

38 

34 

4 

30 

39 

29 

38 

47 

37 

25 

23 

21 

30 

27 

25 

8 

26 

20 

30 

28 

24 

18 

26 

22 

r 


61 

In  this  table  it  can  be  seen  that  for  the  first  two  systems  with  a 

wide  data  word  = 16,  D ~ 2 and  = 16,  D = 4),  the  bit  slice  provides 

a better  partition  than  the  register  partition.  However,  the  combination 

does  better  when  the  degree  of  partitioning  is  high,  namely  for  the  four 

and  eight  chip  implementations.  For  the  second  two  systems  with  a small 

data  word  (N^  = 8,  D = 2 and  = 8,  D = 4),  the  register  slice  performs 

better  than  the  bit  slice  partition.  This  is  assuming  = 1 for  each 

n 

partition.  If  ACp  > 1,  the  register  partition  would  not  be  practical  in 
n 

this  example.  Again,  the  combination  of  the  two  types  of  partitions  per- 
forms best  with  a four  or  eight  chip  implementation. 

3.6  Summary 

The  detailed  analysis  of  the  bit  slice  and  register  partitioning 
methods  shows  the  following  results.  In  both  the  register  and  the  bit 
slice  partitioning  methods,  the  effectiveness  of  the  partition  is  reduced 
as  the  degree  of  partitioning  increases.  This  fact  leads  to  using  a combi- 
nation of  the  two  methods  when  a high  degree  of  partitioning  is  required, 
to  take  advantage  of  the  maximum  reduction  of  pins  by  each  method.  Indeed, 
in  the  four  examples  shown,  the  combination  did  outperform  the  others, 
once  the  number  of  chips  was  four  or  more. 

Another  fact  about  both  partitioning  methods  is  the  rapid  increase 
in  the  total  number  of  pins  required  to  implement  the  system  as  the  number 
of  chips  increases.  This  indicates  that  partitioning  is  very  costly  and 
should  be  used  only  when  necessary  and  only  then  to  the  required  degree. 

The  cost  of  implementing  a larger  area  chip  to  contain  the  unpartitioned 


62 


system  must  be  very  high  in  order  for  the  system  cost  to  be  lower  with  a 
partitioned  system. 

The  strongest  point  in  favor  of  a bit  slice  partition  is  its  per- 
formance with  wide  data  words.  This  occurs  since  the  bit  slice  divides 
the  processor  between  bits  of  the  data  word,  reducing  the  data  pins  per 
chip,  but  requiring  control  pins  to  be  duplicated  on  all  chips  and  re- 
quiring extra  pins  for  link  and  carry  information. 

The  register  partition  divides  the  processor  into  functional  sets. 
This  reduces  the  number  of  control  pins  needed  per  chip  but  increases  the 
number  of  pins  for  data  paths  between  chips.  Systems  with  small  data  words, 
but  a large  number  of  control  lines,  perform  better  with  a register  parti- 
tion provided  register  partitions  with  small  Cp  are  found.  Register  par- 

n 

ti'tions  Can  be  difficult  to  find,  however,  and  sometimes  a practical  one 
is  impossible  to  obtain.  Analysis  of  the  connection  graph  to  find  parti- 
tions with  very  few  cut  data  paths  is  required.  When  it  is  necessary  to 
cut  more  than  one  or  two  data  paths,  register  partitioning  becomes  unattrac- 
tive. 

In  conclusion,  partitioning  should  be  avoided  if  possible,  but  if 
necessary,  a bit  slice  should  be  used  if  the  data  word  is  large  and  a 
register  partition  should  be  used  if  the  number  of  control  lines  is  large 
compared  to  the  size  of  the  data  word.  When  a high  degree  of  partitioning 
is  required,  a combination  of  both  methods  should  be  used.  If  partitioning 
is  necessary,  but  no  attractive  partitions  exist,  the  machine  instruction 
set  and  its  implementation  may  be  mcdifiod  to  allow  for  better  partitions. 
Improving  the  register  partitions  by  this  means  is  apt  to  prove  more  satis- 
factory than  trying  to  improve  bit  slice  partitions. 


63 

Thus,  for  simple  processors,  it  has  been  shown  that  the  multiple 
instruction  stream  pipeline  architecture  is  attractive  in  its  use  of  area 
and  pins.  Partitioning  when  necessary  to  reduce  pin  count  or  area  per 
chip  may  be  successful  to  a limited  degree  using  the  bit  slice,  register 
slice  or  combination  methods.  Using  several  separate  simple  single  stream 
processors  is  wasteful  of  both  area  and  pins.  The  next  chapter  considers 
one  comparison  of  a simple  multiple  stream  pipelined  processor  with  a 
more  complex  single  stream  processor. 


1 


64 

CHAPTER  4.  EFFECTS  OF  REGISTER  SET  SIZE  ON  PERFORMAJJCE  AND  COST 

4.1  Performance  Tradeoffs  between  a Single  Stream  Processor  and  a 
Pipelined  Procf'ssor 

The  pipelined  multiple  process  CPU  is  effective  in  obtaining  high 
utilization  of  the  memory  port  pins,  but  to  improve  performance  it  must 
not  greatly  increase  the  number  of  memory  references  needed  per  program 
over  that  of  a single  stream  multiple  data  register  processor.  The  ad- 
ditional data  registers  in  the  single  process  CPU  reduce  the  number  of 
memory  references  required,  which  decreases  the  execution  time  of  a process. 
The  decreases  in  references  and  time  are  proportional  only  if  execution 
time  is  proportional  to  the  number  of  memory  references  required.  Other- 
wise the  system  is  not  memory  limited  and  execution  time  is  less  sensitive 
to  the  number  of  memory  references.  For  our  analysis  we  assume  the  worst 
case  for  multiple  process  CPU's:  that  the  systems  are  memory  limited. 

The  pipelined  processor  executes  several  programs  concurrently, 
with  each  generating  one  memory  reference  each  processor  cycle.  It  es- 
sentially replaces  idle  time  of  the  memory  port  between  memory  references 
with  references  from  the  additional  processes.  Since  the  pipelined  pro- 
cessor works  with  multiple  processes,  a copy  of  the  register  set  must  be 
contained  in  the  processor  for  each  process  in  execution.  The  pipelined 
processor  requires  chip  area  to  implement  these  registers.  It  is  thus  im- 
portant to  use  a minimal  register  set  to  reduce  the  area  needed  for  these 
registers  to  a minimum. 

Since  each  instruction  is  structured  to  issue  one  memory  refer- 


ence each  processor  cycle,  the  total  execution  time,  under  the  memory 


65 


limited  assumption,  is  approximately  equal  to  the  number  of  memory  references 
multiplied  by  the  processor  cycle  time.  The  pipelined  processor  has  a 
separate  process  issuing  a memory  reference  each  phase  of  the  processor 
cycle.  This  increases  the  number  of  memory  references  per  unit  of  time, 
and  hopefully  the  system  performance,  over  that  of  a single  stream  proces- 
sor. There  is,  however,  a degradation  of  performance  due  to  the  reduced 
number  of  data  registers  in  the  pipelined  processor.  This  degradation  of 
performance  is  evaluated  by  the  analysis  in  this  chapter.  However,  the 
memory  limited  assumption  favors  multiregister  processors  since  it  assumes 
that  even  they  can  execute  fast  enough  to  issue  a memory  reference  each 
cycle. 

The  high  use  of  memory  by  the  pipelined  processor  will  increase 
the  memory  conflict  problem.  However,  it  has  been  shown  [BRI77]  using  a 
pipelined  interleaved  memory  struct-. re  that  the  performance  degradation 
due  to  memory  conflicts  can  be  made  negligible. 

The  remainder  of  this  chapter  describes  an  analysis  of  the  regis- 
ter usage  pattern  of  some  typical  programs  to  determine  the  number  of  memory 
references  issued  for  various  register  set  sizes.  The  results  are  used  to 
determine  the  performance  of  a pipelined  multiprocessor  and  a single  stream 
multiregister  processor.  Performance  vs  cost  comparisons  are  made  between 
these  two  processors  using  the  memory  reference  results. 

4.2  Performance  Determination 

Processor  performance  is  now  developed  as  a function  of  the  num- 
ber of  data  registers,  R,  available  to  a single  stream  and  the  number  of 


66 


instruction  streams,  S,  concurrently  in  execution.  The  system  performance, 

P , is  expressed  as 
R ) b 

p^,s  = P(R)S  (4.1) 

for  a system  with  R data  registers  per  stream  and  S segments  of  pipelining. 
P(R)  is  the  performance  of  a single  processor  with  R registers.  To  get 
Pr  g,  P(R)  is  multiplied  by  the  number  of  pipeline  segments  S.  Pipelining 
increases  the  performance  by  S since  instead  of  one  program  executing  in 
the  typical  fashion  in  the  processor,  the  processor  cycle  is  divided  into 
S segments  each  containing  a separate  process.  Each  process  flows  through 
the  pipe  in  one  normal  processor  cycle  time.  Therefore  system  performance 
is  single  stream  performance  multiplied  by  S.  Each  process  is  independent 
of  the  results  of  and  the  hardware  used  by  any  other  process  in  the  pipe. 

No  dependencies  or  hardware  resource  conflicts  occur  within  the  processor 
and  P(R)  is  not  affected  by  S.  This  model  assumes  that  memory  is  designed 
for  minimal  access  conflict. 

In  order  to  evaluate  P(R),  the  execution  time  of  a processor  with 
R registers,  denoted  T(R),  is  determined,  where 

T(R)  = (4.2) 


The  overall  performance  of  the  system  can  be  evaluated  with  the  effective 

execution  time  of  the  program  denoted  T (R)  where 

e 


T (R) 
e 


l__  _ T(R) 

Pr,s  ~ ~ ^ 


(4.3) 


67 


The  effective  execution  time  for  a pipelined  processor  is  equi- 
valent to  the  time  it  would  take  to  execute  a program  on  the  pipelined 
processor  if  it  could  be  divided  into  S independent  processes  of  equal 
execution  time  and  run  concurrently  on  the  pipelined  processor.  Alterna- 
tively, it  may  be  viewed  as  the  time  to  execute  S processes  on  an  S-way 
pipelined  processor  divided  by  S. 

Since,  as  described  in  4.1,  the  execution  time  is  proportional 
to  the  number  of  memory  references  generated,  an  analysis  of  the  number  of 
memory  references  required  as  a function  of  the  register  set  size  is 
needed.  It  is  difficult  to  determine  the  number  of  memory  references 
saved  by  increasing  the  number  of  registers  beyond  the  number  allowed  to  be 
referenced  in  the  instruction  set.  This  would  require  keeping  track  of  all 
the  data  in  memory.  However,  it  is  pwGsible  to  evaluate  the  performance 
degradation  of  a process,  coded  to  refer  to  many  data  registers,  as  the 
register  set  size  is  reduced.  Some  work  concerning  register  usage  has  been 
done  [LUN77] , [LUN74] , but  this  work  does  not  give  the  necessary  information 
to  evaluate  the  degradation  of  performance  as  a function  of  the  number  of 
registers  available.  It  does,  however,  indicate  that  processor  performance 
would  not  be  degraded  significantly  by  a reduction  in  register  set  size.  A 
similar  study  for  a stack  machine  is  described  in  [BLA77],  Therefore,  an 
analysis  of  register  set  sizes  smaller  than  that  incorporated  in  a large 
machine  would  be  valid  for  finding  an  optimal  performance/ cost  tradeoff. 

4.3  Program  Trace  Analysis 

In  order  to  determine  the  degradation  of  performance  as  the  data 
register  set  size  is  reduced,  several  program  traces  from  the  IBM  System  360 


68 


were  analyzed.  The  System  360  was  used  since  it  has  a larger  register  set 
size  than  any  microprocessor  available  as  yet  and  program  trace  data  was 
readily  available.  The  register  reference  pattern  was  studied  to  deter- 
mine which  register  contents  would  be  stored  temporarily  in  memory  as  the 
register  set  size  is  decreased.  This  analysis  was  performed  assuming  an 
optimal  dynamic  mapping  of  referenced  registers  to  memory  and  physical  pro- 
cessor registers. 

For  a specific  register  set  size,  each  time  data  has  to  be  stored 
in  memory  due  to  the  lack  of  enough  processor  registers,  a register  fault 
occurs.  The  total  number  of  register  faults  for  each  register  set  size  is 
totaled  for  the  entire  program  trace  analyzed.  The  total  number  of  System 
360  memory  references  generated  in  the  program  trace  is  also  totaled.  Then 
for  each  register  set  size  the  number  of  register  faults  per  memory  reference 
is  computed.  The  run  time  of  a process  is  taken  to  be  the  number  of  cycles 
required  to  execute  the  process  and  the  number  of  cycles  is  estimated  as 
the  number  of  memory  references  required,  following  our  "memory  limited" 
performance  assumption.  The  number  of  instructions  was  not  used  as  the 
normalizing  factor  since  the  number  of  register  faults  would  vary  among 
instructions,  different  programs  have  different  instruction  utilizations 
and  time  of  execution  of  various  instructions  also  varies. 

Once  the  number  of  register  faults  is  known,  the  degradation  of 
performance  is  determined  by  assuming  each  register  fault  requires  two  addi- 
tional memory  references,  one  to  store  the  data  from  some  processor  regis- 
ter to  memory  and  one  to  load  that  register  from  memory  with  the  refer- 
enced register’s  data.  Just  two  cycles  are  needed  since  it  is  assumed  that 


69 


registers  are  mapped  into  dedicated  memory  Locations  addressable  by  regis- 
ter tags  of  360  programs  and  no  address  computation  cycles  are  needed. 

The  register  to  be  overwritten  is  selected  using  a variation  of  Belady's 
[BEL66]  optimal  dynamic  paging  strategy  which  overwirtes  (reassigns)  that 
processor  register  which  will  be  next  referenced  furthest  into  the  future. 

The  execution  time  for  a processor  with  R registers  is  computed 

as  two  times  the  number  of  register  faults,  F , plus  the  total  number  of 

R 

memory  references,  N , all  divided  by  the  total  number  of  memory  refer- 

MR 

» 

ences. 


T(R) 


2F  +N 
R MR 


MR 


(4.4) 


This  formula  is  normalized  to  the  number  of  memory  references 
required  by  a p.'ocessor  with  all  20  System  36U  registers.  This  provides 
a measure  of  the  performance  degradation  relative  to  a standard  processor 
with  a large  number  of  registers. 

This  optimal  dynamic  allocation  of  data  registers  will  give  the 
fewest  number  of  faults  for  each  register  set  size.  This  scheme  actually 
favors  the  results  for  large. register  sets  since  optimal  allocation  is  more 
easily  reached  in  practice  for  small  register  set  sizes.  This  is  espec- 
ially true  for  a small  processor  where  the  code  generation  software  is 
not  as  sophisticated  as  a 360.  The  instruction  set  of  the  360  is  also 
more  powerful  and  able  to  use  a larger  set  of  registers  more  optimally. 

The  more  sophisticated  instruction  set  also  requires  fewer  fetches  which 
makes  data  memory  references  a more  significant  portion  of  the  total  number 
of  memory  references.  All  these  points  will  cause  the  results  of  this 


70 


analysis  to  be  biased  toward  favoring  larger  data  register  sets  than  would 
actually  be  useful  in  a small  processor. 

One  problem  with  the  analysis  performed  is  that  some  of  the  total 
of  20  registers  are  used  for  addressing  purposes.  Other  analyses  of  the 
same  traces  used  here  was  performed  by  Hammerstrom  [HAM77].  This  anaylsis 
has  shown  that  on  the  average  only  three  or  four  registers  are  used  for 
addressing  leaving  about  16  for  data  usage.  This  effect  does  not  cause  in- 
accuracies in  the  evaluation  data  for  register  set  sizes  less  than  15  or 
16,  which  is  sufficient  for  our  purposes.  Also, four  registers  of  the  360 
are  dedicated  for  floating  point  operations  and  similar  data  in  an  LSI 
processor  as  discussed  here  x.Tould  be  kept  in  the  same  register  set  and  be 
interchangeable  with  other  data,  as  contrasted  with  the  360.  Our  analysis 
has  assumed  that  the  floating  point  registers  are  interchangeable  with 
fixed  point  registers  as  in  the  LSI  processors.  Any  inaccuracies  due  to 
this  effect  would  be  caused  by  larger  register  sizes  being  required  in 
some  cases  in  the  360  and  multiple  registers  in  some  cases  would  be  de- 
sirable if  an  LSI  processor  if  high  precision  arithmetic  is  needed  in  the 
LSI  processor. 

The  analysis  data  obtained  here  should  thus  be  reasonably  ac- 
curate if  it  is  assumed  that  an  LSI  processor  has  at  least  enough  registers 
of  sufficient  word  width  to  execute  any  single  one  of  its  basic  instructions 
without  register  faults,  if  its  registers  are  allocated  properly  at  the 
beginning  of  the  instruction.  Given  this  assumption,  other  assumptions 
which  affect  the  accuracy  of  our  results  cause  a consistent  bias  in  favor 
of  large  R processors  over  small  R processors. 


71 


4,4  Program  Analysis  Procedure 

Program  traces  were  generated  for  three  programs,  two  doing 
numerical  analysis  computations  and  one,  list  processing.  The  first 
program,  GAUS,  which  performs  a Gaussian  elimination  on  a 20  x 20  matrix 
to  solve  a set  of  simultaneous  linear  equations.  GAUS  was  written  in 
Fortran  IV.  The  second  program,  EIGEN,  is  a Fortran  numerical  analysis 
program  which  determines  the  eigenvalues  of  a 14  x 14  matrl.x.  This  program 
was  taken  from  the  Eigensystem  Subroutine  Package  of  the  National  Activity 
to  Test  Software  project.  The  third  program,  LIST,  is  a list  processing 

I 

i program  which  first  builds  list  structures  then  examines  the  tree  struc- 

! 

I 

ture  of  the  data  and  determines  the  set  of  all  accessible  nodes  for  each 
sublist  head. 

After  a program  is  traced,  the  first  step  in  the  analysis  is  to 
remove  the  instructions  that  deal  with  addressing  since  only  data  register 
usage  is  to  be  studied.  A filter  program  was  used  with  the  filter  algorithm 
described  in  [HAM77].  This  provides  a trace  including  only  data  register 
usage  instructions.  Next,  the  sequence  of  references  to  data  registers 
made  by  the  program  is  determined  from  the  filtered  trace.  This  sequence 
of  register  references  is  used  to  determine  optimal  register  assignment  for 
each  register  set  size  from  1 to  20  using  a stack  algoritiim  as  in  [BEL66]. 
From  the  register  assignment,  the  number  of  register  faults  required  is 
totaled  for  each  register  set  size.  The  total  number  of  memory  references 
in  the  original  trace  is  totaled  and  used  to  normalize  the  number  of  memory 
references  required  by  the  program  for  each  register  set  size.  The  effec- 
tive execution  time  for  each  value  of  R is  computed  by  formula  4.4.  The 
performance  is  inversely  proportional  to  T^. 


72 


I 


Programs  were  written  to  perform  the  above  tasks.  The  algorithm 
used  to  determine  register  faults  follows. 

1.  Obtain  the  data  register  reference  sequence  from  the  fil- 
tered program  trace,  recording  which  register  is  referenced 
and  whether  a read  or  write  reference  is  made  to  that  re- 
gister. 

2.  Read  each  register  reference  from  the  sequence  from  step  1 . 
Number  them  in  order  starting  at  the  beginning  of  the  sequence. 
Then  reverse  the  sequence  placing  it  in  a file. 

3.  For  this  step  a table  is  constructed  which  records,  while 
scanning  the  reversed  sequence,  the  number  of  the  last  re- 
ference for  each  register.  Each  entry  in  the  table  is  ini- 
tially set  to  infinity.  The  reversed  sequence  is  read.  As 
each  register  reference  is  read,  the  number  of  this  register 
reference  is  placed  in  the  table  position  corresponding  to 
this  register  and  the  reference  in  the  trace  is  numbered 
with  the  overwritten  number  from  the  table.  The  modified 
reverse  trace  is  stored. 

4.  Reverse  the  sequence  from  step  3 to  form  a forward  trace. 

This  gives  the  original  register  reference  sequence  but  as- 
sociated with  each  register  reference  is  the  number  of  the 
next  reference  to  this  register. 

A pseudo-stack  is  used  for  steps  5 and  6.  The  basic  operations 
used  are  the  conventional  stack  "pop"  and  an  "insert"  operation:  for 
• li.rh  an  item  may  be  inserted  anywhere  in  the  stack  and  items  below  it. 


73 


pushed  down.  The  stack  can  have  up  to  20  entries  (one  for  each  register 
in  the  360).  The  top  of  the  stack  represents  that  register, among  the  set 
of  registers  already  referenced  at  least  once  so  far,  which  will  be  re- 
ferenced again  soonest  in  the  sequence.  At  each  point  in  time  the  "stack" 
is  maintained  so  that  the  registers  are  in  sequence  in  increasing  order  of 
the  number  of  their  next  reference.  (The  number  of  each  register's  next 
reference  was  stored  with  each  reference  in  step  3.)  Along  with  each 
register  in  the  stack,  the  greatest  depth  attained  by  that  particular  re- 
gister since  its  last  reference  is  recorded  with  the  stack  entry. 

5.  Read  in  each  reference  in  sequence  from  the  trace  generated 
at  step  4.  If  the  referenced  register  is  in  the  stack  at 
all,  it  must  be  at  the  top  of  the  stack.  In  this  case,  add 
one  register  fault  to  each  register  set  size  that  is  less 
than  the  "depth"  that  this  register  had  reached  in  the  stack 
since  its  last  reference.  Do  this  only  if  this  reference 
was  a read  from  the  register.  (If  it  were  a write,  the  old 
data  needn't  have  been  kept  and  a register  fault  would  not 
occur.)  Insert  this  register  in  the  stack  in  the  appropriate 
position  and  set  the  "depth"  of  this  register  to  its  current 
position  in  the  stack. 

If  the  referenced  register  is  not  in  the  stack,  then  in- 
sert it  into  the  stack  according  to  the  number  of  its  next 
reference,  set  its  "deoth"  to  f^^is  position  in  the  stack  and 
increment  the  "depth"  of  any  rc^g).ster  below  this  register 
which  now  is  in  a deeper  position  in  the  stack  than  its  cur- 
rent value  of  "depth." 


74 


6.  Repeat  step  5 for  all  references  in  the  sequence  from 
step  4. 

This  gives  the  number  of  register  faults  for  each  register  set  size  from 
1 to  20. 

The  results  obtained  from  these  operations  give  a fairly  accurate 
measure  of  the  register  usage  in  the  360.  However,  there  are  some  instruc- 
tions that  use  registers  for  both  data  and  addressing  procedures,  and  it 
is  difficult  to  distinguish  between  these  during  the  filtering  operations. 
The  error  involved  should  be  small  and  only  if  the  conclusions  reached  are 
sensitive  to  small  variations  in  the  numerical  results  (which,  as  it  turns 
out,  they  are  not)  would  the  error  be  significant. 

4.5  Program  Results 

Table  4.1  lists  the  number  of  references  to  each  register  for  the 
three  programs  analyzed.  As  could  be  expected,  certain  registers  were 
used  much  more  often  than  others.  This  is  especially  true  for  the  numerical 
analysis  programs, where  registers  17  through  20,  the  floating  point  regis- 
ters, are  used  very  extensively.  This  indicates  that  an  implementation 
with  only  the  highly  used  registers  implemented  would  reduce  the  register 
set  size  without  much  degradation  of  performance.  Some  of  these  20  regis- 
ters are  used  by  the  360  for  addressing  purposes,  but  many  more  than  the 
three  to  four  registers  normally  used  for  addressing  [HAM77]  have  low 
usage . 

After  obtaining  a register  reference  sequence  the  stack  algorithm 


is  applied  to  determine  how  many  register  faults  are  generated  for  regis- 
ter set  sizes  less  than  20.  Tables  4.2,  4.3  and  4.4  are  a listing  for 


Table  4.1(a).  ReKister  Use  Data  for  "Gi'US”. 


Register  No. 

Mo.  of  Reads 

No.  of  V/rites 

1 

4997 

5471 

2 

488 

479 

3 

1490 

448 

4 

1488 

487 

5 

4035 

1060 

6 

3870 

104 

7 

272 

24  1 

8 

3266 

3415 

9 

279 

245 

10 

1179 

836 

1 1 

276 

320 

12 

222 

283 

13 

0 

1 

14 

c 

1 

15 

0 

0 

16 

0 

15 

17 

1738 

1757 

18 

13153 

13155 

19 

3848 

3242 

20 

3080 

266 

Total 

43686 

31846 

Table  4.1(b). 

Resister  Use  Data 

for  "EIGEN". 

Register  No. 

No.  of  Reads 

No.  of  Writes 

1 

2388 

2176 

2 

371 

600 

3 

636 

103 

4 

885 

296 

5 

2067 

432 

6 

1954 

415 

7 

636 

546 

8 

1701 

1859 

9 

93 

93 

10 

345 

306 

11 

439 

25 

12 

208 

141 

13 

0 

1 

14 

6 

1 

15 

244 

246 

16 

246 

2 

17 

4729 

3957 

18 

12551 

1 1232 

19 

6306 

4936 

20 

2377 

1810 

Total 

38182 

29179 

76 


Table  il.1(c). 

Register  Use  Data  for 

"LIST". 

Pesister  Mo. 

No.  of  Reads 

Mo.  of 

1 

159^ 

1695 

2 

1220 

1373 

■5 

620 

826 

II 

1707 

1637 

5 

2595 

2312 

6 

10?7 

1096 

7 

2183 

1604 

6 

1763 

1141 

9 

1173 

1265 

10 

690 

728 

11 

181 

133 

12 

351 

4 

1? 

4 

0 

lii 

652 

974 

It; 

1188 

1507 

16 

1:3  1 

1 156 

17 

21 

21 

18 

21 

7 

19 

0 

0 

20 

0 

0 

Total 

17431 

17679 

Table  4.2 


Eeeister  Fault  Data  for  "GA.l'S". 


Resiater 

Number  of 

Fecister  Faults 

Performance 

Set  Si2e 

Pezister  Faults 

cer  Kern.  Ref. 

Degradation 

1 

22448 

.2991 

.5982 

2 

20195 

.2691 

.5382 

3 

17956 

.2392 

.4785 

4 

17447 

.2325 

.4649 

E 

14050 

. 1672 

.3744 

6 

6849 

.0Q1? 

. 1825 

7 

3814 

.0508 

.1016 

6 

1863 

.0248 

.0496 

9 

1224 

.0iu3 

.0326 

10 

1 154 

.0154 

.0308 

1 1 

662 

.0088 

.0176 

12 

147 

.0020 

.0039 

1? 

127 

.0017 

.0034 

14 

110 

.0015 

.0029 

15 

90 

.0012 

.0024 

16 

0 

.0000 

.0000 

17 

0 

.0000 

.0000 

16 

0 

.0000 

.0000 

19 

0 

.0000 

.0000 

20 

0 

.0000 

.0000 

Total  number 

of 

miemory  references 

75052 

Total  number 

of 

memory  data  reads 

1890'( 

Total  number 

of 

memory  data  writes 

7om 

Total  number 

of 

register  references 

75532 

Table  U.3.  Restister  Fault  Data  for  "EIGEN". 


Reels 

ter 

Number  of 

Reeister  Faults 

Performance 

Set  S 

ize 

Reeister  Faults 

per  Mem.  Ref. 

Degradation 

1 

20098 

.3188 

.6375 

2 

1tt764 

.2342 

.4683 

•3 

1164? 

.1847 

.3694 

ii 

9854 

.1563 

.3126 

5 

5745 

.0911 

.1822 

6 

3966 

.0629 

.1258 

7 

3095 

.0ti91 

.0982 

8 

2733 

.0433 

.0867 

9 

2253 

.0357 

.0715 

10 

1784 

.0283 

.0566 

11 

509 

.0081 

.0161 

12 

202 

.0u46 

.0093 

1? 

163 

.0026 

.0052 

lit 

11’ 

.0018 

.0036 

15 

60 

.0010 

.0019 

16 

48 

.0008 

.0015 

17 

24 

.0004 

.0008 

18 

1 

.0000 

.0000 

19 

0 

.0000 

.0000 

20 

0 

.0000 

.0000 

Total 

number 

of 

memory  references 

63048 

Total 

number 

of 

memory  data  reads 

16088 

Total 

number 

of 

memory  data  writes 

6797 

Total 

number 

of 

reeister  references 

6736  1 

79 


Table  4.U.  Recister  Fault  Data  for  "LIST". 


Reel? 

ter 

Number  of 

Register  Faults 

Performance 

Set  S 

ize 

Register  Faults 

per  Mem.  Pef. 

Degradation 

1 

9622 

.1360 

.2760 

2 

7370 

. i057 

.2114 

3 

5981 

.0858 

.1716 

ii 

4841 

.0694 

.1389 

5 

3884 

.0557 

.1114 

6 

3549 

.0509 

.1018 

7 

2985 

.0428 

.0656 

8 

259? 

.0364 

.0727 

9 

I69b 

.0272 

.0544 

10 

1475 

.0212 

.0423 

11 

1 1 15 

.0160 

.0320 

12 

593 

.0085 

.0170 

13 

280 

.0040 

.0080 

14 

43 

.0006 

.0012 

15 

1? 

.0002 

.0004 

16 

1? 

.0002 

.0004 

17 

0 

.0000 

.0000 

16 

0 

.0000 

.0000 

19 

n 

.0000 

.0000 

20 

0 

.0000 

.0000 

Total 

number 

of 

mem.ory  references 

69723 

Total 

number 

of 

memory  data  reads 

1 1640 

Total 

number 

of 

memory  data  writes 

8451 

Total 

number 

of 

reerister  references 

35110 

GADS,  EIGEN  and  LIST  respectively,  of  the  total  number  of  register  faults 
generated,  the  number  of  register  faults  per  memory  reference  and  the  per- 
formance degradation  for  register  set  sizes  from  1 to  20. 

The  interesting  figure  here  is  the  low  number  of  register  faults 
per  memory  reference  even  with  only  one  data  register.  The  highest  being 
only  slightly  more  than  .3  with  a performance  degradation  of  .64.  The  two 
numerical  analysis  programs,  G,  S and  EIGEN,  have  fairly  similar  distribu- 
tions but  the  file  manipulation  program,  LIST,  has  the  fewest  register 
faults  until  the  register  set  size  reaches  seven. 

Figure  4.1  is  a graph  of  the  data  from  Tables  4.2  through  4.4. 
This  graph  shows  the  trend  more  clearly.  As  the  register  set  increases, 
the  register  faults  decrease  in  a manner  roughly  approximating  an  exponen- 
tial decay.  All  three  programs  have  a similar  number  of  register  faults 
when  the  register  set  is  greater  than  seven.  Both  numerical  analysis  pro- 


grams have  a jog  in  the  graph  for  register  set  sizes  from  four  to  six. 

The  file  manipulation  program  is  rather  smooth. 

Since  three  to  four  registers  on  the  average  are  used  for  ad- 
dressing, no  more  than  17  registers  are  usually  available  for  data.  This 
is  why  the  number  of  register  faults  is  very  close  to  zero  for  register 
set  sizes  of  about  15  to  16. 


4.6  Evaluation  of  Performance 

Figure  4.1  gives  effective  execution  time  as  a function  of  the 
number  of  data  registers  for  a single  process  CPI’ . The  worst  degradation 
of  performance  is  when  R is  one.  Figure  4.1  ind:  ;ates  that  T(l)  is  about 

1.6  for  the  two  numerical  analysis  programs  (EIGEN  and  GAUS)  and  about  1.3 


82 


for  the  file  manipulation  program  (LIST),  Thus  very  little  degradation  in 
performance  results  from  reducing  the  number  of  data  registers  to  an  abso- 
lute minimum.  As  registers  are  added,  perfornance  does  not  increase  much 
for  each  added  register  and  the  law  of  diminishing  returns  can  be  seen  to 
take  effect  as  the  curve  levels  out  for  more  than  seven  or  eight  data  regis- 
ters. At  this  point  performance  is  only  degraded  by  10  percent.  Equation 
4.3,  repeated  here  for  reference, 

T (R)  = (4.3) 

e S 

relates  the  effective  execution  time  of  a pipelined  machine  with  the  above 
data.  The  effective  execution  time  for  several  cases  with  values  of  R from 
one  to  16  and  S from  one  to  eight  have  been  evaluated  and  tabulated  in 
Tables  4,5,  4.6  and  4.7  for  GAUS,  EIGEN  and  LIST  respectively.  The  dra- 
matic decrease  in  T^(R)  with  pipelining  is  evident.  In  all  three  programs, 
the  single  register  two  segment  pipeline  out  performs  the  16  register  single 
segment  processor.  As  pipelining  continues,  large  decreases  in  execution 
time  occur  although  the  Improvement  per  segment  added  becomes  less.  This 
can  be  seen  when  comparing  several  cases  from  Tables  4.5  through  4.7. 

Examine  Table  4.5  for  example.  Here,  taking  the  R = 1 case,  the 
decrease  of  execution  time  from  S = 1 to  S = 2 is  50  percent.  The  decrease 
from  S = 2 to  S = 3 is  only  33  percent  and  so  on.  This, however,  is  signifi- 
cantly more  than  the  decrease  in  execution  time  when  examining  the  effect 
of  increasing  R.  For  the  S » 1 case,  the  decrease  of  execution  time  from 
R =»  1 all  the  way  to  R = 16  is  only  37  percent.  Further,  15  registers  must 
be  added  to  the  processor  to  obtain  this  improvement.  This  data  shows  that 


83 


Table  4.5.  Effective  Execution  Time  for  Various  Processors 


for  the  Program  "GAUS" 

S 

R 

1 

2 

3 

4 

5 

6 

7 

8 

1 

1.598 

.799 

.533 

.400 

.320 

.266 

.228 

.200 

2 

1.538 

.769 

.513 

.385 

.308 

.256 

.220 

.192 

3 

1.478 

.739 

.493 

.370 

.296 

.246 

.211 

.185 

4 

1.465 

.733 

.488 

.366 

.293 

.244 

.209 

.183 

5 

1.374 

.687 

.458 

.344 

.275 

.229 

.196 

.172 

6 

1.183 

.592 

.394 

.296 

.237 

.197 

.169 

.148 

7 

1.1016 

.551 

.367 

.275 

.220 

.184 

.157 

.138 

8 

1.0497 

.525 

.350 

.262 

.210 

.175 

.150 

.131 

9 

1.0326 

.516 

.344 

.258 

.207 

.172 

.148 

.129 

10 

1.0308 

.515 

.344 

.258 

.206 

.172 

.147 

.129 

11 

1.0176 

.509 

.339 

.254 

.204 

.170 

.145 

.127 

12 

1.0039 

.502 

.335 

.251 

.201 

.167 

.143 

.125 

13 

1.0034 

.502 

.334 

.251 

.201 

.167 

.143 

.125 

14 

1.0029 

.501 

.334 

.251 

.201 

.167 

.143 

.125 

15 

1.0024 

.501 

.334 

.251 

.200 

.167 

.143 

.125 

16 

1.0000 

.500 

.333 

.250 

.200 

.167 

.143 

.125 

84 


Table  4.6.  Effective  Execution  Time  for  Various  Processors 
for  the  Program  "EIGEN" 


s 

R 

1 

2 

3 

4 

5 

6 

7 

8 

1 

1.638 

.819 

.546 

.410 

.328 

.273 

.234 

.205 

2 

1.468 

.734 

.489 

.367 

.294 

.245 

.210 

.184 

3 

1.369 

.685 

.456 

.342 

.274 

.228 

.196 

.171 

4 

1.313 

.657 

.438 

.328 

.263 

.219 

.188 

.164 

5 

1.182 

.591 

.394 

.296 

.236 

.197 

.169 

.148 

6 

1.126 

.563 

.375 

.282 

.225 

.188 

.161 

.141 

7 

1.098 

.549 

.366 

.275 

.220 

.183 

.157 

.137 

8 

1.087 

.544 

.362 

.272 

.217 

.181 

.155 

.136 

9 

1.071 

.536 

.357 

.268 

.214 

.179 

.153 

.134 

10 

1.057 

.529 

.352 

.264 

.211 

.176 

.151 

.132 

11 

1.016 

.508 

.339 

.254 

.203 

.169 

.145 

.127 

12 

1.009 

.505 

.336 

.252 

.202 

.168 

.144 

.126 

13 

1.005 

.503 

.335 

.251 

.201 

.168 

.144 

.126 

14 

1.004 

.502 

.335 

.251 

.201 

.167 

.143 

.126 

15 

1.002 

.501 

.334 

.251 

.200 

.167 

.143 

.125 

16 

1.002 

.501 

.334 

.251 

.200 

.167 

.143 

.125 

86 


i 

i 

I 


[ 


adding  registers  to  decrease  execution  time  with  a memory  limited  system, 
as  described  here,  is  not  very  effective.  Furthermore,  if  a system  were 
not  memory  limited,  or  if  one  of  our  other  assumptions  did  not  hold,  pipe- 
lining would  be  preferred  over  added  registers  to  an  even  greater  extent. 

Examining  the  tables  for  the  other  program  traces  gives  similar 
results.  Adding  registers  for  the  case  of  the  LIST  program  is  even  less 
effective  than  for  the  other  two  programs.  This  is  probably  due  to  less 
locality  of  data  in  this  program  and  the  use  of  fewer  temporary  results. 

The  tradeoff  between  adding  registers  vs  pipelining  can  be  ob- 
served by  examining  graphs  of  constant  performance  as  a function  of  R and 
S.  Figures  4.2,  4.3  and  4.4  are  plots  of  constant  performance,  for  the 
programs  GAUS,  EIGEN  and  LIST  respectively,  which  are  computed  from  data 
in  Tables  4.5  through  4.7.  These  curves  represent  "equlperformance"  pro- 
files and  are  labeled  with  their  relative  performance  with  respect  to  a 
single  stream,  16  data  register  processor.  The  x's  are  plotted  for  integer 
values  of  S and  interpolated  for  R to  obtain  smooth  curves  which  indicate 
the  lower  bound  for  values  of  R and  S of  systems  for  each  level  of  perform- 
ance plotted. 

As  one  follows  a curve  of  constant  performance  from  the  one  re- 
gister case  toward  increasing  R,  the  curve  "flattens"  rapidly  toward  an 
asymptote  equal  to  S = 1/T^(16).  However,  even  when  R is  small,  the  slope 
is  not  great  and  the  increase  necessary  in  S to  maintain  constant  perform- 
ance is  small. 

The  flatter  the  curve,  the  less  effective  the  addition  of  regis- 
ters is.  As  the  effective  execution  time  decreases  (performance  increases) 
the  curves  become  more  steep  for  low  values  of  R.  This  indicates  that  when 


I 


A 


s;uaai699  aujiadj^ 


Figure  4.2.  Constant  Performance  Plot  for  "GAUS 


cDN-cDin^rocvj^O 


stuauuOac  9U!|9du 


4.3.  Constant  Performance  Plot  for  "EIGEN 


LIST 


90 


higher  degrees  of  plpeliuing  are  used,  register  addition  is  more  effective 
than  adding  registers  in  a low  degree  pipe.  Of  course  the  cost  of  adding 
registers  in  this  case  is  also  greater  since  each  added  register  must  have 
a copy  for  each  stream.  If  these  plots  were  extended  for  higher  degree 
pipelines,  addition  of  registers  when  R is  small  would  become  more  effec- 
tive. However,  pipelining  to  a very  high  degree  may  not  be  practical. 

For  each  particular  case  the  performance  must  be  evaluated  as 
well  as  the  cost  of  the  alternative  ways  of  reaching  that  performance. 

The  effects  of  register  addition  and  pipelining  on  performance  have  been 
shown.  Their  effect  on  cost  must  also  be  considered. 

4.7  Cost-Effectiveness 

The  cost  of  the  LSI  processors  as  modeled  here  is  composed  pri- 
marily of  two  parameters,  number  of  pins  and  chip  area.  The  pin  problem 
has  been  discussed  in  Chapter  3.  However,  the  number  of  pins  required  by 
a processor  as  S and  R vary  remains  constant  and  the  nature  of  bit  slice 
and  register  partitioning  is  unchanged,  assuming  that  the  data  registers 
are  treated  as  a block.  The  difference  in  cost  between  the  pipelined  pro- 
cessor and  the  multiple  register  processor  is  chip  area,  which  is  now  in- 
vestigated. 

The  chip  area  can  be  divided  into  five  classifications  according 
to  its  use  in  implementation.  These  are  control  area,  functional  unit  area, 
register  area,  interconnection  area  and  pin  driver  area,  denoted  A^,  A^., 

Aj^,  Aj  and  A^^  respectively.  Quantitatively  the  area  cost  can  be  expressed 
as 


^total 


\ ^ \ ^I  "S)* 


(4.5) 


91 


These  area  classifications  can  be  grouped  into  two  groups,  those 
that  are  constant  for  the  range  of  processors  discussed  in  the  chapter  and 
those  which  are  dependent  on  degree  of  pipelining  and/or  number  of  data 
registers.  Since  the  pin  requirements  over  the  range  of  processors  do 
not  change,  is  constant  with  all  implementations  studied.  All  processors 
execute  the  same  instruction  set  and  control  is  approximately  the  same  other 
than  that  needed  to  control  the  additional  functional  units  in  the  pipe- 
lined case  and  additional  registers  in  the  multiple  register  case.  Control 
can  be  grouped  as  part  constant  and  part  incremental  dependent  on  the  values 
of  R and  S.  Functional  unit  area  has  a minimum  component  representing  a 
minimal  set  of  logic  units  and  a component  to  represent  the  additional 
units  needed  when  pipelining.  The  register  area  can  be  represented  with 
four  components,  a component  for  registers  other  than  data  registers,  a 
component  for  data  registers  and  two  components  representing  copies  of 
both  data  and  nondata  registers  needed  for  pipelining.  The  interconnection 
area  is  needed  for  all  control  and  data  signals  and  can  be  attributed  to 
the  functional  units  and  registers  that  require  these  signals.  The  inter- 
connection cost  is  therefore  not  treated  separately,  but  rather  absorbed 
into  the  cost  of  appropriate  functional  units  and  registers. 

The  total  area  cost  thus  can  be  expressed  approximately  as 

C = k-Q  + kiR  + k^CS-l)  + k^R(S-l)  (4.6) 

where  k^  represents  the  constant  cost  of  driver  area,  constant  portion  of 
control  and  functional  units,  the  raw  data  register  cost  and  the  intercon- 
nection area  associated  with  them,  k represents  the  area  cost  of  each 


92 


data  register  for  the  nonplpellned  case  and  the  associated  control  and 
Interconnection  area  for  these  registers.  k2  represents  the  additional 
area  cost  required  for  pipelining  the  nondata  registers  and  the  additional 
logic  unit  area  needed  as  pipelining  Is  Increased,  represents  the  addi- 
tional area  for  registers  when  pipelining  the  data  registers.  Both  k^  and 
k^  depend  on  (S-1)  Instead  of  S since  these  are  marginal  costs  beyond  those 
of  the  basic  single  process  CPU. 

The  major  cost  differences  over  the  range  of  processors  are  addi- 
tional functional  unit  cost  for  the  pipelined  case,  the  additional  register 
cost  for  either  the  multiple  register  or  pipelined  case  and  the  intercon- 
nection cost  associated  with  these  units.  The  multiple  register  processor 
does  not  require  any  additional  functional  units  over  those  of  the  basic 
CPU  except  for  multiplexers  and  demultiplexers  needed  by  the  additional 
registers.  Most  of  the  nonconstant  cost  for  this  case  represents  additional 
register  area  and  interconnection  area.  This  is  what  k^  represents. 

The  pipelined  processor  adds  area  for  replication  of  each  regis- 
ter, data  and  nondata,  needed  for  each  segment  of  the  pipe,  and  for  the 
additional  functional  units  required  when  one  is  used  more  than  once  during 
a cycle,  provided  the  segmentation  of  the  pipe  divides  the  processor  cycle 
between  separate  uses  of  the  functional  unit  during  a cycle.  There  is  some 
area  saved,  however,  with  pipelining.  When  an  additional  functional  unit 
is  added,  fewer  intercoftnections  are  needed  since  the  units  can  be  placed 
closer  to  the  sections  of  the  processor  requiring  their  use.  The  other 
savings  of  area  is  that  the  interconnection  area  needed  per  register  to  con- 
nect registers  in  the  pipelined  fashion  is  likely  to  be  less  than  the 


i 


\ 


L 


93 


interconnection  area  required  per  register  to  extend  the  basic  register  set. 
This  is  because  the  pipelined  structure  of  layout  is  very  regular  and  regu- 
lar patterns  tend  to  require  less  interconnection  area  than  less  well- 
structured  patterns  do.  Since,  in  the  pipelined  processor,  data  is  always 
shifted  to  the  next  register,  dynamic  registers  can  be  used  throughout. 

These  require  less  area  than  do  static  registers. 

The  relative  costs  of  the  various  uses  of  area  vary  widely  with 
the  specifics  of  the  CPU  to  be  implemented  and  the  technology  used.  With 
additional  information,  values  for  k^,  k^,  k^  and  k^  can  be  assigned  and 
cost  tradeoffs  evaluated. 

To  obtain  a crude  approximation  of  the  relative  costs  of  alterna- 
tive processors,  certain  simplifying  assumptions  can  be  made  for  the  con- 
stants in  Equation  4.6.  Assuming  that  each  register  will  cost  the  same 
amount  of  area  regardless  of  its  use  as  a new  register  or  a replication  re- 
quired for  pipelining  an  old  register,  coefficient  k^  will  be  set  equal  to 
k^-  Secondly,  it  is  assumed  that  the  cost  of  the  additional  functional 
units  for  pipelining  is  balanced  approximately  by  the  savings  in  area  for 
interconnection  to  these  units  and  the  reduced  interconnection  area  per 
register  required  when  pipelining.  Utilizing  this  and  the  first  assumption, 
the  constant  k^  will  be  set  equal  to  the  number  of  nondata  registers  times 
k^.  For  the  processor  of  Figure  2.5,  k^  = 5kj^. 

Equation  4-6  can  be  rearranged  as 

C » - k^  + (kj^-k2)R  + k2S  + k^RS.  (4.7) 

Invoking  the  previous  assumptions  and  substituting  kj^  = k^  and  k2  = Sk^^  gives 


94 


C = kg  - 5k^  + kj^(5+R)S 


(4.8) 


Normalizing  Equation  4.8  to  the  constant  term  k_  - 5k  and  letting  C 


R,S 


ko-5ki 


and  K = . _c.  gives 
0 1 


• 1 + k’(5+R)S  (4.9) 

Using  this  equation,  the  approximate  costs  for  various  processors  with  per- 
formance given  in  Tables  4.5  through  4.7  can  be  compared. 

It  is  interesting  to  compare  the  cost  of  the  single  process, 
many  register  CPU  (R=16,  S=l)  to  that  of  the  single  register,  two  segment 
CPU  (R=l,  S=2).  For  the  program  GAUS,  the  effective  execution  time  for 
R =■  1,  S = 2 is  about  .8,  which  is  better  than  the  single  process  CPU  for 
R * 16,  for  which  T^(16)  = 1.0.  The  cost  for  the  single  process  CPU 
(R^IS,  S»l)  is 

= 1 + k' (5+16)1  = 1 + 21k'  (4.10) 

For  the  two  segment  pipelined  CPU  (R®1,  S=2) 

C^  2 ” 1 + k' (5+1)2  = 1 + 12k'.  (4.11) 

Thus,  under  the  assumptions  made  about  costs,  the  best  single  process  CPU 

f 1 

costs  significantly,  9K  , more  (provided  K is  significant  compared  to  1) 
than  the  worst  performing  pipelined  CPU  and  that  pipelined  CPU  performs 
25  percent  better.  Thus  the  R,S  =■  1,2  processor  is  estimated  to  be  both 
cheaper  and  faster  than  the  R,S  * 16,1  processor. 

Several  other  cases,  comparing  the  costs  of  single  data  register 
processors  with  various  levels  of  pipelining  and  multiple  register 


95 


processors  with  less  pipelining  but  equal  performance  are  summarized  in 
Table  4.8  for  the  program  GAUS.  In  each  case,  for  equal  performance,  the 
higher  level  pipeline  was  cheaper.  Thus,  based  on  the  cost  assumptions, 
it  is  better  to  increase  the  level  of  pipelining  than  to  add  data  registers. 
Comparisons  using  the  programs  "EIGEN"  and  "LIST"  are  summarized  in  Tables 
4.9  and  4.10  respectively.  Similar  results  are  attained  for  "EIGEN"  and 
even  more  convincing  results  for  "LIST." 

These  cost  results  are  only  crude  approximations,  but  if  actual 
parameters  could  be  obtained  for  k^,  and  k^  results  could  be  obtained 
for  actual  processors  in  a given  technology.  It  is  hard  to  imagine  that 
more  precise  modeling  and  actual  parameters  for  a given  technology  would 
alter  the  conclusions  drawn  here.  The  intent  here  has  been  only  to  approxi- 
mate the  nature  of  cost  and  performance  dependency  of  pipelining  vs  the  use 
of  additional  data  registers. 

4.8  Conclusion 

It  has  been  shown  in  this  chapter  that  performance  is  enhanced 
more  by  pipelining  than  by  adding  a significant  number  of  data  registers. 

In  fact,  it  was  found  that,  when  extending  the  data  register  set  of  a 
single  process  CPU  even  to  16  data  registers,  it  performed  no  better  than 
80  percent  of  the  performance  of  a single  data  register,  two  segment  pipe- 
lined CPU. 

These  performance  evaluation  were  based  on  the  data  register 
reference  pattern  within  the  IBM  360  for  three  program  traces.  Although 
the  360' s performance  is  much  different  than  that  of  a small  processor, 
these  differences  by  and  large  detract  from  the  value  of  pipelining  relative 


96 


Table  4.8.  Costs  for  Processors  of  Equal  Performance 
for  the  Program  "GAUS" 


Table  4.9.  Costs  for  Processors  of  Equal  Performance 
for  the  Program  "EIGEN" 


7 6 1 + 78K 


5.4 


2 


8 


1 + 56K 


97 


Table  4.10.  Costs  for  Processors  of  Equal  Performance 
for  the  Proeram  "LIST" 


. 1 6 1 + 36K 

7 6 1 + 72k' 

. 1 7 1 + 42k’ 


98 


to  using  more  registers  for  a single  stream  processor.  The  conclusions 
reached  would  be  even  more  strongly  justified  by  more  realistic  data. 

The  number  of  memory  references  was  used  as  the  basis  for  per- 
formance for  a single  processor  as  register  set  size  varied.  The  various 
performance  levels  as  a function  of  the  level  of  pipelining  and  the  regis- 
ter set  size  were  evaluated  from  the  memory  reference  data  and  summarized 
in  Tables  4.5  through  4.7.  An  approximate  cost  evaluation  was  performed 
on  this  data  which  supported  the  previous  conclusions  on  grounds  of  cost 
as  well  as  performance.  More  specific  performance/cost  data  is  dependent 
on  the  specific  machine  characteristics  and  capabilities  and  on  the  tech- 
nology used,  but  for  the  data  gathered  here,  pipelining  is  an  efficient 
and  practical  architecture  for  an  LSI  processor. 


99 

CHAPTER  5.  CONCLUSION 

The  purpose  of  this  research  is  to  determine  future  architectures 
suitable  for  implementation  with  LSI,  taking  into  consideration  the  trend 
in  growth  of  practical  LSI  chip  capacity.  Since  the  area  of  the  chip  has 
been  and  is  expected  to  continue  increasing  exponentially,  it  was  determined 
that  pin  limitations  of  future  LSI  implementations  would  be  a major 
constraint  on  feasible  implementations.  In  order  to  obtain  high  pin 
utilization,  which  reduces  the  number  of  pins  required  on  the  chip,  as 
well  as  high  performance,  a multiprocess  pipeline  was  chosen  as  being  most 
suited  to  achieve  these  objectives.  It  needed  to  be  shown,  however,  how 
such  a processor  could  be  organized.  A methodology  to  determine  the 
organization  and  final  layout  of  a multiprocess  pipelined  CPU  suitable  for 
LSI  implementation  was  developed.  This  methodology  is  based  on  a one 
memory  port  processor,  which  is  necessary  to  minimize  the  number  of  pins 
required.  The  methodology  begins  with  a given  instruction  set,  designed 
for  efficient  execution  of  the  intended  program  mix. 

The  objective  of  the  organization  is  to  generate  memory  references 
as  soon  as  possible  during  the  processor  cycle  while  minimizing  the  number 
of  internal  resources.  By  using  the  time  between  memory  references  of  one 
process,  to  issue  memory  references  from  other  processes,  high  throughput 
can  be  attained  with  only  one  memory  port.  This  organization  is  particularly 
suited  for  LSI  implementation  because  of  its  regular  structure  which 
reduces  the  amount  of  interconnection  area  required  to  allow  the  addition  of 
active  devices. 


It  is  also  desirable  to  know  what  effects  would  occur  if  a processor 
had  to  be  partitioned  into  more  than  one  chip  and  how  this  could  effectively 
be  done.  Two  reasonable  methods  of  partitioning  were  found,  the  bit  slice 
and  register  partition.  These  were  analyzed  to  determine  how  the  pin 
count  was  affected  by  partitioning.  It  was  found  that  the  bit  slice 
partition  is  useful  to  partition  a processor  with  a large  data  word,  com- 
pared to  the  number  of  control  lines.  The  register  partition  performs 
better  when  the  number  of  control  lines  necessary  is  large  compared  to  the 
data  word  size  and  when  cut  sets  exist  which  divide  the  I/O  buses  and 
chip  area  appropriately  without  cutting  too  many  internal  data  paths.  A 
combination  of  the  two  partitioning  methods  is  best  when  a high  degree  of 
partitioning  is  required.  In  all  cases,  however,  the  number  of  pins  for 
the  chip  set  becomes  very  high  as  partitioning  is  increased.  It  was  also 
found  that  the  effectiveness  of  each  partitioning  method  in  reducing  the 
number  of  pins  per  chip  decreases  as  partitioning  increases. 

The  performance  of  the  pipelined  multiprocessor  was  compared  with 
that  of  a single  process  CPU  with  multiple  data  registers.  A multiple 
data  register  processor  is  one  approach  some  manufacturers  have  used  to 
utilize  additional  area  to  improve  performance  without  increasing  pin  re- 
quirements. The  pipelined  multiprocessoi  uses  few  data  registers  to  allow 
area  for  adding  registers  required  for  pipelining.  To  determine  the 
effectiveness  of  adding  additional  registers,  the  register  usage  pattern 
of  a processor  with  many  data  registers,  the  IBM  System  360,  was  studied. 
Traces  of  several  programs  were  analyzed  to  determine  the  degradation  in 


101 


! 

j 

i 

t 


performance  as  the  register  set  was  reduced.  This  data  was  then  used  to 
determine  the  performance  of  various  single  data  register  pipelined  pro- 
cessors and  several  single  process  CPU's  with  multiple  data  registers. 

It  was  found  that  pipelining  was  substantially  more  effective  in  increasing 
the  performance  than  adding  data  registers.  The  cost  of  these  two  types  of 
processors  were  modeled  and  an  approximate  cost  comparison  made  for  pro- 
cessors with  the  same  performance.  It  was  found  that  the  pipelined  pro- 
cessor was  more  cost-effective.  These  results  show  that  the  pipelined  multi- 
processor is  a good  organization  for  future  LSI  designs. 

There  are  several  other  aspects  of  this  processor  which  could  be 


investigated  further.  The  data  register  effectiveness  has  been  analyzed 
but  it  would  be  interesting  to  analyze  the  effect  of  the  number  and  type 


of  addressing  registers  on  performance.  The  program  counter  register  is 
probably  necessary  but  the  effectiveness  of  other  addressing  registers 
such  as  the  stack  pointer  register,  indexing  registers  and  base  registers 
needs  to  be  investigated. 

Also  of  importance  to  the  pipelined  multiprocessor  is  the 
selection  of  an  instruction  set.  Sequences  of  basic  instructions  could 
also  be  analyzed  to  find  optimal  instruction  sets  for  pipelined  processors. 
The  layout  of  this  processor  is  dependent,  specifically  on  the  type  of 
Instructions  chosen  to  be  executed  by  the  processor.  Careful  selection 
could  improve  the  performance  and  retain  a desirable  organization  for  the 
processor.  It  may  also  be  determined  that  certain  job  requirements  often 
encountered  (e.g.,  execution  of  parallel  real  time  tasks)  can  be  effectively 
handled  with  a pipelined  multiprocessor. 


L. 


The  method  of  communicating  between  processes  and  scheduling  of 
processes  for  a pipelined  processor  should  also  be  studied.  Certain  environ- 
ments such  as  time  sharing,  lend  themselves  very  well  toward  a pipelined 
multiprocessoii  but  system  utilization  in  different  environments  needs  to 
be  analyzed.  There  are  operating  systems  which  control  multiple  processes 
now  and  their  adaptation  to  this  pipelined  multiprocessor  should  be 
straightforward. 


103 


BIBLIOGKAPHY 


BEL66  Belady,  L.  A.,  "A  Study  of  Replacement  Algorithms  for  a Virtual 
Storage  Computer,"  IBM  Sys . J . . Vol.  9,  1966,  78-101. 

BLA77  Blake,  R.  P.,  "Exploring  a Stack  Architecture,"  Computer,  Vol.  1, 
No.  5,  May  1977,  30-38. 

BRI77  Briggs,  F.  A.,  "Memory  Organization  and  Their  Effectiveness  for 
Multiprocessing  Computers,"  CSL  Report  R-768,  University  of 
Illinois,  May  1977. 

DEC77  DEC  (Digital  Equipment  Corporation) , Processor  Handbook  PDPll/45, 
1974. 

FOR62  Ford,  L.  R.  and  Fulkerson,  D.  R.,  Flows  in  Networks,  Princeton 
University  Press,  1962. 

HAM77  Hamraerstrora,  D.  W. , "Analysis  of  Memory  Addressing  Architecture," 
CSL  Report  R-777,  University  of  Illinois,  July  1977. 

HP  HP  (Hewlett-Packard  Co.),  A Pocket  Guide  to  HP  Computers  (HP  2114A, 

2115A  and  2116B  Computers) . 

KET76  Ketelsen,  M.  L.,  "An  Integrated  Circuit  Fault  Model  for  Digital 
Systems,"  CSL  Report  R-743,  University  of  Illinois,  September 
1976. 

KOG77  Kogge,  ?.  E. , "The  Microprogramming  of  Pipelined  Processors," 

Proceedings  of  the  Fourth  Annual  Symposium  on  Computer  Architec- 
ture, April  1977,  63-69. 

LUN77  Lunde,  A.,  "Empirical  Evaluation  of  Some  Features  of  Instruction 
Set  Processor  Architectures,"  Communications  of  the  ACM,  Vol.  20, 
No.  3,  March  1977,  143-153. 

LUN74  Lunde,  A.,  "Evaluation  of  Instruction  Set  Processor  Architecture 

by  Program  Tracing,"  Ph.D.  Thesis,  Carnegie-Mellon  U.,  Pittsburgh, 
PA,  July  1974. 

MAY72  Mayeda,  W. , Graph  Theory,  John  Wiley  and  Sons,  Inc.,  1972,  420-443. 

NOY76  Noyce,  R.  N.,  "From  Relays  to  MPU's,"  Computer , Vol.  9,  No.  12, 

Dec.  1976,  26-29. 

PEN72  Penney,  W.  M.  and  Law,  L.,  ed.,  MCS  Integrated  Circuits,  Van 
Nostrand  Reinhold  Co.,  1972. 

SHA74  Shar,  L.  E.  and  Davidson,  E.  S.,  "A  Multiminiprocessor  System 
Implemented  Through  Pipelining,"  Computer,  Feb.  1974,  42-51. 


104 


APPENDIX  A.  INSTRUCTION  PRECEDENCE  GRAPHS  I 

The  following  is  a list  of  the  instruction  precedence  graphs  for 
each  instruction  of  the  instruction  set  listed  in  Table  2.7.  Table  A.l 
describes  each  of  the  functional  components  used  in  the  precedence  graphs. 

(• 

J 


I 

i 


I 


i 

I 

L 


105 


table  A.l.  Functional  Components 


A1 

MD  ^ A 

A2 

ADD  or 

MD  SUB  or  , 
lOR 

A3 

A SHIFT  -»•  A 

A4 

TEST  A 

A5 

A -►  A 

A6 

A + 1 A 

A7 

A (±)  1 -*■  A 

B1 

MD  -»■  MA 

B2 

PC  -*■  MA 

B3 

MD  + MA  -►  MA 

B4 

SP  ->■  MA 

D1 

D2 

PC  -»■  MD 

D3 

A MD 

I 

Ml 

" '^'addr 

M2 

PI 

PC  + 1 PC 

P2 

MA  ->■  PC 

SI 

SP  + 1 SP 

S2 

MA  -*■  SP 

Figure  A. 2.  Instruction  Precedence  Graphs  II. 


r=^ 


I 

I 

i 

I 


I 


I 


JUMf 

INO 


I 


JMP  SUB 


jMP  sue 

INOIRECT 


109 


r 


APPENDIX  B.  USING  GRAPHS  TO  FIND  REGISTER  PARTITIONS 

In  order  to  perform  a register  partition  on  a processor,  the 
interconnection  structure  of  the  processor  must  be  studied  to  find  the  best 
places  to  divide  the  processor.  This  appendix  describes  a heuristic 
using  a graphical  representation  of  the  processor  to  find  desirable 
partitions.  This  problem  may  appear  to  resemble  the  well-known  minimal 
cut  problem,  but  differs  since  a cut  between  two  particular  vertices  is 
not  required.  Assignment  of  vertices  is  dependent  on  the  interconnection 
pattern  between  vertices  and  the  amount  of  area  required  by  each  vertex 
such  that  a near  equal  division  of  area  is  obtained.  The  following  pro- 
cedure can  be  used  to  obtain  reasonable  partitions. 

Figure  B1  is  a graph  derived  in  Chapter  3 for  the  computer  used 
as  an  example  in  this  thesis.  The  first  step  is  to  determine  the  connection 
and  terminal  capacity  matrices  for  this  graph.  Each  vertex  in  this  graph 
is  identified  both  by  the  register  or  communication  path  it  represents 
and  a unique  number  which  will  be  used  to  identify  it  in  the  matrices  (the 
communication  path  vertices  are  only  identified  with  a number) . The  con- 
nection matrix  is  determined  easily  by  inspection  and  is  shown  in  Equation 
B.l  for  the  example. 

The  connection  matrix  defines  all  connections  between  vertices. 

In  order  to  find  partitions  the  terminal  capacity  matrix  [May  72]  is  first 


used.  This  matrix  specifies,  for  any  graph,  the  maximum  flow  between  pairs 
of  vertices  which  is  equal  to  the  minimum  number  of  connections  to  be  cut 


110 


Ill 


c 


1 


1 2 3 4 5 6 7 


8 9 10  11 


1 

d 

1 

1 

0 

0 

0 

0 

0 

0 

0 


0 

1 

d 

0 

0 

0 

0 

0 

0 

0 

0 


0 0 0 
10  0 
0 0 0 
d 2 1 

2 d 0 
1 0 d 

10  1 
0 0 0 
0 0 0 
0 0 0 
0 0 1 


0 0 0 
0 0 0 
0 0 0 
10  0 
0 0 0 
10  0 
dll 
1 d 0 
1 0 d 

10  1 
0 0 0 


0 

0 

0 

0 

0 

0 

1 

0 

1 

d 

1 


0 

0 

0 

0 

0 

1 

0 

0 

0 

1 

d 


(B.l) 


if  these  vertices  are  divided  [FOR  62] 
the  example  is  shown  in  Equation  B.2. 


1 2 3 4 5 6 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 


d 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 


1 

d 

1 

1 

1 

1 

1 

1 

1 

1 

1 


1 

1 

d 

1 

1 

1 

1 

1 

1 

1 

1 


1 

1 

1 

2 

2 

d 

3 

1 

2 

3 

2 


The  terminal  capacity  matrix  for 


7 8 9 10  11 

1 1 1 1 f 

11111 
11111 

2 12  2 2 

2 12  2 2 

3 12  3 2 

d 1 2 3 2 

1 d 1 1 1 

2 1 d 2 2 

3 1 2 d 2 

2 1 2 2 d 


(B.2) 


The  terminal  capacity  matrix  indicates  vertex  pairs  which  have 
a large  number  of  connection  paths.  (The  number  of  connection  paths  is 
equal  to  the  maximum  number  of  signals  that  could  start  at  one  vertex  and 
reach  the  other  using  disjoint  paths  in  the  graph.  These  connection  paths 
are  useful  for  determining  groups  which  are  highly  interconnected  and  should 
be  placed  on  the  same  chip. 

Step  two  groups  vertices  in  the  matrix  with  a large  number  of 


112 


connection  paths  between  them  and  places  those  vertices  which  are  defined  to 

be  external  connections.  This  grouping  is  performed  by  interchanging  pairs 

of  rows  and  columns  in  the  matrix.  The  i*"^  group  is  denoted  as  g^.  First 

the  outside  connection  vertices  are  grouped  together.  In  the  example, 

vertices  1 and  8 representing  the  memory  data  and  address  busses,  are 

assigned  to  g^^  which  simply  represents  necessary  external  buses.  Next, 

the  largest  of  the  elements  of  the  terminal  capacity  matrix  (t  .)  are 

i>  J 

identifed.  Vertices  i and  j are  assigned  to  a group  g2’  other 

vertices,  k,  where  t,  . is  the  same  as  t . . where  £ e g_,  are  also  assigned 

K,  X.  1,  J Z 

to  group  g-.  Next,  for  each  t = t , where  neither  m nor  n are  assigned 
z in  9 n X ) J 

to  any  group,  m and  n are  assigned  to  a new  group.  Then,  for  any  other 

vertices,  k,  where  t,  - = t.  .,  for  Jl  £ g.  and  k ^ g for  any  j,  vertex  k 

k,x>  i,j  1 J 

is  added  to  group  g^.  The  vertices  of  a group  are  placed  next  to  each  other 
in  the  terminal  capacity  matrix.  Any  vertex  k with  the  next  highest 
values  of  tj^  ^ where  £ is  assigned  to  a group  are  placed  close  to  the  group 
in  the  reordered  terminal  capacity  matrix. 

Equation  B.3  is  the  reordered  terminal  capacity  matrix  for  the 
example  used  here. 

Added  in  this  equation  are  weights  for  each  vertex  representing 
the  relative  amount  of  chip  area  requried  by  each  vertex.  In  this  example 
node  3 is  given  weight  of  1-1/2  since  it  represents  the  instruction  decode 
circuitry  as  well  as  the  IR  register.  The  vertex  weights  are  indicated 
in  parentheses  above  the  column  vertex  label.  These  weights  (w^  = weight 
of  vertex  i)  are  used  to  determine  whether  the  amount  of  area  required  for 


113 


(0)  (0)  (0)  (l|)  (1)  (1)  (1)  (1)  (0)  (1)  (0) 

1823459  11  6 7 10 

iFd  111  1 1 1 1 1 1 1“ 

8 _1 1 1_ 1 1 1 _1  _ 1 1 1 

2 r 1 Y d 1 1 1 1 ill  1 1 

311|ldlllllll 
411111d222222 
T = 511!112d  2 2 2 2 2 (B.3) 

^ 91ljll22d2222 

11  1 1 ;_1 1 2 __2 2 _d  2__2 Z_ 

6 1 1 1 1 i 2 2 2 2 ! d 3 3 

71  111  1 2 2 2 213  d 3 

10|_1  i:i  1 2 2 2 2j3  3 ^ 


any  particular  group  is  less  than  the  maximum  (W  ) able  to  be  Implemented 

with  one  chip.  A maximum  connection  limit  denoted  must  not  be 

MAX 

exceeded  by  each  group  which  fits  onto  one  chip.  These  limits  will  be  used 
to  find  the  final  partition  using  the  following  procedure.  The  values  of 
these  limits  for  the  example  will  be  = 3,  = 3. 

Step  3 involves  forming  the  connection  matrix  for  the  reordered 
terminal  capacity  matrix.  Step  2 found  groups  with  many  connection  paths, 
but  actual  connections  is  the  main  criterion  for  the  final  partition  and 
the  connection  matrix  must  be  used  for  this.  The  reordered  connection 
matrix  for  the  example  used  here  is  in  Equation  B.4. 


(0)  (0)  (0)  (l|)  (1)  (1)  (1)  (1)  (0)  (1)  (0) 

1823459  11  67  10 


114 


The  two  groups  formed  in  step  two  are  = (1,8)  and  g^  = (6,7,10). 
Each  of  the  remaining  vertices  will  be  assigned  to  a group  of  one  vertex. 

The  groups  assigned  to  each  of  the  remaining  vertices  are  as  follows: 


83  * (2) 

84  = (3) 

85  = W 

h “ ^5) 

87  = (9) 

88  = (11) 


Merges  of  groups  are  now  found  to  form  final  partitions,  denoted  P^,  which 
reduce  the  connections  as  much  as  possible. 

Step  4 is  the  first  merging  step.  First  some  definitions  are 
given  that  will  be  used  in  the  merging  procedure. 

A target  group  is  a group  with  > 0 which  other  groups  are 
merged  with  to  reduce  the  number  of  connections  or  to  fill  the  merged 
group  chip  area.  The  other  group  is  called  the  merged  group. 

A partitioned  unit,  denoted  P^,  is  a group  that  is  finished  being 

merged. 


R . denotes  the  reduction  of  connections  obtained  by  merging 
J 

a group  gj^  with  the  target  group,  g ^ . 

, represents  a merging  index  which  indicates  a gain  obtained 
i » j 

by  the  merge  which  is  equal  to  R divided  by  the  total  weight  of  all 

i»  j 

vertices  of  the  merged  group,  g . 


The  first  part  of  Step  4 assigns  any  completed  group  as  a partitioned 


115 


In  the  example  consisting  of  the  two  external  vertices,  cannot  be  merged 

with  any  groups  and  is  complete.  Therefore  group  g,  is  assigned  to  P^. 

Merges  of  groups  are  now  found.  The  group  with  the  largest 

number  of  outside  connections  and  > 0 is  chosen  as  the  target  group. 

The  number  of  outside  connections  is  equal  to  the  sum  of  all  t.  , where  j is 

i»  J 

a vertex  in  the  group  and  i is  a vertex  not  in  the  group.  In  the  example, 

group  is  the  first  target  group. 

Then  the  group  which  does  not  add  any  external  connections  to 

the  target  group  and  has  the  largest  index  M . with  the  target  group  is 

i»  J 

merged  with  it.  This  step  is  repeated  for  all  groups  as  target  groups. 

The  order  in  which  the  target  groups  are  chosen  does  not  matter  since  if  a 
group  has  connections  only  to  one  group  when  it  is  merged,  there  are  no 
other  groups  with  which  it  could  be  merged  in  this  step.  This  merge  is 
only  carried  out  if  is  not  exceeded  by  performing  the  merge.  This 

step  never  adds  any  connections  to  any  group.  It  only  reduces  them. 

In  the  example,  groups  g^  and  gg  are  merged  with  group  g2  which 
forms  a group  with  3 external  connections  and  a weight  of  3.  This  group 
is  completed  since  has  been  reached  and  no  further  merges  are  possible. 

It  is  now  denoted  as  P2.  Group  g^  is  merged  with  group  g^,  which  forms  a 
group  g^  with  weight  2 and  = 3.  The  only  groups  left  are  g^  = (4,5), 
gg  = (2)  and  g^  = (3).  No  further  merges  of  the  type  described  in  this 
step  are  possible. 

Step  5 consists  of  finding  merges  which  may  add  new  connections 
to  the  group,  but  reduce  the  net  total  of  connections  to  the  group.  Each 


116 


group  has  a set  of  connections  with  the  target  group,  g^,  and  a set  of 
connections  g^  has  h the  target  group  is  greater  than  the  number  g^ 
has  with  outside  groups,  then  the  total  reduction  of  connections  is  the 
difference  between  these  two  sets  of  connections,  which  is  equal  to  ^ . 

If  the  value  of  is  positive  for  a merged  group  with  a target  group, 
then  there  are  no  other  target  groups  with  which  this  group  will  have  a 
positive  value  of  j • This  is  true  since  if  there  were  a second  target 
group,  gj^^  for  which  R^j^  is  positive,  then  the  merged  group  would  have  more 
connections  with  this  second  target  group  g^^  than  with  all  other  groups. 

In  this  case  R^^  must  be  negative.  Therefore,  there  can  be  only  one  target 
group  with  which  a merged  group  could  have  a positive  value  of  j • 

The  group  with  the  largest  number  of  connections  and  > 0 is 

chosen  as  the  first  target  group  and  merged  groups  are  found  which  have  a 

positive  value  of  R . Of  these  groups,  the  group  with  the  largest  value 
i»  j 

of  ^ with  this  target  group  is  merged  first,  if  is  not  exceeded.  If 

'^MAX  exceeded  the  group  with  the  next  highest  value  is  tried.  This 
is  repeated  until  either  the  target  group  is  filled  or  no  further  merged 
groups  are  left.  After  trying  all  possibilities  of  merged  groups,  the  next 
target  group  is  chosen  by  taking  the  group  with  the  next  largest  number  of 
connections  and  > 0. 

In  the  example,  after  Step  4 the  connection  matrix  is  as  shown 
in  Equation  B.5.  The  only  groups  left  areg^,  g^  and  g^.  Both  groups  g^ 
and  g^  have  3 external  connections.  Group  g^  has  no  groups  which  will  re- 
duce its  number  of  external  connections.  Group  g^  reduces  the  number  of 


(0)  (0)  (0)  (li)  (1)  (1)  i (1)  (1)  (0)  (1)  (0) 


0 I d 

0 I 0 


0 I 1 


(B.5) 


connections  to  group  but  has  weight  0 and  cannot  be  used  as  a target 
group.  Therefore  g^  and  g^  cannot  be  merged.  Considering  the  relative 
weights,  group  g^  has  a weight  of  1-j  and  1 connection,  group  g^  has  a 
weight  of  2 and  3 connections.  These  two  groups  cannot  be  merged  since 
'^MAX  exceeded  and  therefore  g^  is  assigned  to  and  g^  to  P^. 

Group  g^  has  weight  zero  and  does  not  have  to  be  assigned.  The  partition 
for  this  example  is  complete.  The  steps  for  the  completion  of  the  partition 
for  the  general  case  follow. 

Step  6 merges  any  groups  g^  with  a target  group  g ^ , where  R^j  is  0. 

The  group  g^  which  has  the  smallest  weight  is  merged  first  if  is  not 

exceeded.  This  is  repeated  until  the  group  is  filled  or  no  further  groups 
with  zero  value  of  ^ are  left.  If  is  exceeded  by  a merge  then  no 

further  groups  are  tried  in  this  step.  This  step  is  repeated  for  the  rest 
of  the  groups  as  target  groups. 

Step  7 merges  groups  with  weight  > 0 which  will  Increase  the  number 


of  connections  to  the  target  group,  but  with  the  greatest  values  of  M, 


1 


118 


A target  group,  g^ , is  chosen  as  the  group  with  the  greatest  number  of 

connections.  Then  groups,  g^,,  with  W > 0 are  examined  and  the  one  with 

the  highest  value  of  ^ is  merged  first  provided  and  are  not 

exceeded.  If  is  exceeded,  the  group  with  the  next  highest  value 

of  M . is  tried.  If  the  merge  is  successful,  the  merged  group  is  returned 
J 

to  the  set  of  groups  and  a new  target  group  chosen. 

This  step  is  repeated  for  each  target  group  with  new  target  groups 
chosen  in  decreasing  order  of  the  number  of  connections  they  have  with 


other  groups.  If  at  any  time  or  are  met,  the  group  is  assigned  to 

a partition  unit  and  removed  from  the  set  of  groups. 

Step  8 has  only  groups  which  cannot  be  merged  with  another  to  form  one 


chip.  If  is  not  exceeded  by  a group,  then  this  group  is  assigned  to  a 

partition  unit.  If  is  exceeded  the  group  must  be  implemented  with  a bit 

slice  partition  and  Step  7 can  be  repeated  with  nearly  doubled  (to 

allow  some  pins  for  duplicated  control)  and  doubled  before  bit  slicing. 

The  final  partition  found  for  the  example  is  shown  in  Table  B.l. 


119 


Table  B.l.  Final  Register  Partition 


Vertices 

1.8 

6,7,9,10,11 

3 


4 


( 


] 


4,5 


2 


120 

VITA 

William  Joseph  Kaminsky,  Jr.,  was  bom  in  East  Chicago,  Indiana 
on  September  18,  1951.  He  received  the  B.S.  degree  in  Engineering  from 
Purdue  University  C.C.  in  1973  where  he  was  selected  as  the  Outstanding 
Student  in  the  School  of  Engineering  in  1973.  He  received  the  M.S.  degree 
in  Engineering  from  the  University  of  Illinois  in  1974.  He  held  engineering 
positions  with  the  Northern  Indiana  Public  Service  Co.  from  1970-1973, 

Motorola,  Inc.  in  1974,  and  Lawrence  Livermore  Laboratory  in  1975.  At  the 
University  of  Illinois  he  was  employed  as  a Research  Assistant  in  the 
Radio  Location  Laboratory  from  1973  to  1974,  a Teaching  Assistant  in  the 
Department  of  Electrical  Engineering  from  1974  to  1975  and  a Research  Assistant 
at  the  Coordinated  Science  Laboratory  from  1976  to  1977. 


